[PDF] DGGAN: Depth-image Guided Generative Adversarial Networks for Disentangling RGB and Depth Images in 3D Hand Pose Estimation

Abstract

Estimating3D hand poses from RGB images is essentialto a wide range of potential applications, but is challengingowing to substantial ambiguity in the inference of depth in-formation from RGB images. State-of-the-art estimators ad-dress this problem by regularizing3D hand pose estimationmodels during training to enforce the consistency betweenthe predicted3D poses and the ground-truth depth maps.However, these estimators rely on both RGB images and thepaired depth maps during training. In this study, we proposea conditional generative adversarial network (GAN) model,called Depth-image Guided GAN (DGGAN), to generate re-alistic depth maps conditioned on the input RGB image, anduse the synthesized depth maps to regularize the3D handpose estimation model, therefore eliminating the need forground-truth depth maps. Experimental results on multiplebenchmark datasets show that the synthesized depth mapsproduced by DGGAN are quite effective in regularizing thepose estimation model, yielding new state-of-the-art resultsin estimation accuracy, notably reducing the mean3D end-point errors (EPE) by4.7%,16.5%, and6.8%on the RHD,STB and MHP datasets, respectively.

Full PDF

DDGGAN: Depth-image Guided Generative Adversarial Networks forDisentangling RGB and Depth Images in 3D Hand Pose Estimation

Liangjian Chen , Shih-Yao Lin , Yusheng Xie *3 , Yen-Yu Lin , Wei Fan , and Xiaohui Xie University of California, Irvine , Tencent America , Amazon , National Chiao Tung University , { liangjc2,xhx } @ics.uci.edu , { shihyaolin,davidwfan } @tencent.com , [email protected] , [email protected] Abstract

Estimating D hand poses from RGB images is essentialto a wide range of potential applications, but is challengingowing to substantial ambiguity in the inference of depth in-formation from RGB images. State-of-the-art estimators ad-dress this problem by regularizing D hand pose estimationmodels during training to enforce the consistency betweenthe predicted D poses and the ground-truth depth maps.However, these estimators rely on both RGB images and thepaired depth maps during training. In this study, we proposea conditional generative adversarial network (GAN) model,called Depth-image Guided GAN (DGGAN), to generate re-alistic depth maps conditioned on the input RGB image, anduse the synthesized depth maps to regularize the D handpose estimation model, therefore eliminating the need forground-truth depth maps. Experimental results on multiplebenchmark datasets show that the synthesized depth mapsproduced by DGGAN are quite effective in regularizing thepose estimation model, yielding new state-of-the-art resultsin estimation accuracy, notably reducing the mean D end-point errors (EPE) by . , . , and . on the RHD,STB and MHP datasets, respectively.

1. Introduction

Vision-based D hand pose estimation ( D HPE) aims toestimate the D keypoint coordinates of a given hand image. D HPE has drawn increasing attention owing to its wideapplications to human-computer interaction (HCI) [1, 21],sign language understanding [34], augmented/virtual reality(AR/VR) [22, 15], and robotics [1]. RGB images and depthmaps are two the most commonly used input data for the DHPE task. An example of a hand image and its correspond-ing depth map is shown in Figure 1(a). Depth map can pro- * Work done prior to joining Amazon (a) (b)

Figure 1. Training examples in a generic D HPE dataset: (a)paired RGB and depth images; (b) unpaired RGB and depth im-ages. Our work does not rely on paired training data and thereforeis applicable to both RGB-only and depth-only D HPE tasks. vide D information related to the distance of the surfaceof human hands. Training networks with depth maps hasbeen proven to achieve signiﬁcant progress on the D HPEtask [4, 16]. In addition, with the depth information pro-vided by the depth maps, the hand segmentation task canbe effectively solved. Unfortunately, capturing depth mapsoften requires speciﬁc sensors ( e.g . Microsoft Kinect, Re-alSense), which limits the usability of those state-of-the-artmethods based on depth maps. Commercial depth sensorsare usually much more expensive than RGB cameras. Onthe other hand, RGB images are the most commonly usedinput data in the HPE task because it can be easily capturedby abundant low-cost optical sensors such as webcams andsmartphones. However, D HPE from RGB images is achallenging task.In the absence of depth information, estimating D handpose from a monocular RGB image is intrinsically an ill-posed problem. To address this issue, the state-of-the-artmethods such as [4, 10] leverage both RGB hand imagesand their paired depth maps for the D HPE task. Their Dhand pose inference process takes an RGB image and thepaired depth information into account. They ﬁrst regress D hand poses on RGB images, and then utilize a separatebranch to regularize the predicted D hand pose by using thepaired depth maps. The objective of the depth regularizeris to make the predicted D keypoint positions consistent a r X i v : . [ c s . C V ] D ec ith the provided depth map. It results in two major advan-tages: 1) training networks with depth maps can efﬁcientlyimprove the hand pose estimator by using the depth infor-mation to reduce the ambiguity and 2) enabling D HPEbased on merely RGB images during the inference stage.These approaches require paired RGB and depth trainingimages. Unfortunately, most existing hand pose datasetsonly contain either depth maps or RGB images, instead ofboth. It makes the aforementioned approaches not applica-ble to such datasets. Besides, the unpaired RGB and depthtraining images cannot be explorited for them. Figure 1(b)shows an example of unpaired RGB and depth map images.To tackle this problem, we propose a novel genera-tive adversarial networks, called

Depth-image Guided GAN (DGGAN). Our network contains two modules: depth-mapreconstruction and hand pose estimation . The main idea ofour approach is to directly reconstruct the depth map froman input RGB hand image in the absence of paired RGBand depth training images. Given an RGB image, our depth-map reconstruction module aims to infer its depth map. Ourhand pose estimation module takes RGB and depth infor-mation into account to infer the D hand pose. In the handpose estimation module, we infer the D hand keypoints onthe input RGB image, and regress the D hand pose by us-ing the inferred 2D keypoints. The depth map is then used toregularize the inferred D hand pose. Unlike most existing D HPE models, the real depth maps used to train our DG-GAN model do not require any paired RGB images. OnceDGGAN is learned, the proposed HPE module directly in-fers the hand pose by using an RGB image and guided(regularized) by a DGGAN-inferred depth map. Since thedepth-map can be inferred by our depth-map reconstructionmodule, the proposed DGGAN no longer requires pairedRGB and depth images. Our DGGAN jointly trains thetwo modules in an end-to-end trainable network architec-ture. Experimental results on multiple benchmark datasetsdemonstrate that our DGGAN not only reconstructs thedepth map of an input RGB image, but also signiﬁcantly im-proves the D hand pose estimator via an additional depthregularizer.The main contributions of this study are summarized asfollows:1. We propose a depth-map guided adversarial neural net-works (DGGAN) for D hand pose estimation fromRGB images. Our network can jointly infer the depthinformation from input RGB images and estimate the3D hand poses.2. We introduce a depth-map reconstruction module toinfer the depth maps from input RGB images whilelearning to predict D hand poses. Our DGGAN istrained on readily accessible hand depth maps that arenot paired with RGB images. 3. Experimental results demonstrate that our approachachieves new state-of-the-art in D hand pose predic-tion accuracy on three benchmark datasets, includingthe RHD, STB, and MHP datasets.

2. Related Work

Research topics related to this work are discussed below. D HPE from depth mapshas been extensively stud-ied. Existing approaches in this ﬁeld make noticeable ad-vances [29, 33, 8, 31, 9, 11, 20]. Wan et al . [29] proposea dense regression approach to ﬁt the parameters of a de-formed hand model. Ge et al . [9, 11] present PointNet[24]to extract hand features and regress hand joint locations byreferring to the extracted features. Wu et al . [31] adoptthe intermediate dense guidance map supervision to gener-ate hand heatmaps. Although the existing methods achievevery accurate estimation results, they typically rely on thehand data captured by high-precision depth sensors, whichare still expensive to have in practice and usually requiredata collection in a lab environment. Different from themodels in the aforementioned methods, our model performsinference on RGB data without the need of depth maps.

Due to the wide availability of RGB cameras, D HPEfrom monocular RGB images is becoming increasinglypopular in computer vision applications. Many recent meth-ods aim at estimating hand joint locations directly from asingle RGB image [4, 16, 10, 38, 22, 6, 32, 3, 36, 28]. Zim-mermann et al . [38] use D convolutional neural networks(CNN) to extract features from an RGB image, and regressthe D hand joint locations. However, their method suffersfrom depth ambiguity due to the absence of depth informa-tion. Developing the methods upon the work by Zimmer-mann et al ., Iqbal et al . [16] and Cai et al . [4] inherit andadopt a similar D CNN architecture for extracting imagefeatures. Iqbal et al . use depth maps as intermediate guid-ance while Cai et al . treat depth maps as a regularizer ina weakly supervised manner. Though these two methodsmake substantial progress in terms of estimation accuracy,there currently exist few datasets that fulﬁll their require-ment of paired depth maps and RGB images. Ge et al . [10]take one step further by predicting the hand mesh from anRGB image and then the 3D hand joint locations based onthe mesh. However, their method requires paired mesh in-formation which is even rarer among all existing datasets.Compared with these methods, our method also usesdepth information during training, but it does not requireany paired RGB images and depth maps. Thus, it is muchmore ﬂexible since it can consume RGB images and depthmaps from different datasets or sources. igure 2. Overview of the proposed DGGAN. DGGAN consists of two modules, a depth-map reconstruction module shown in Figure 4and a hand pose estimation module shown in Figure 4. The former module trained using the GAN loss aims at inferring the depth map ofa hand based on the input RGB image and making the generated depth map looks realistic. The latter module trained using the task lossestimates hand poses from the input RGB and the GAN-reconstructed depth images.

To further enhance D HPE [2, 3, 10, 18], hand mesh es-timation can be included. Namely, the model estimates notonly the hand joints but also the hand surface mesh. How-ever these methods such as [10] have a common drawback:They require additional mesh annotations which are evenmore expensive to obtain than joint locations. Thus, theyare typically trained on synthetic datasets due to this lim-itation. Seungryul et al . [3] introduce an iterative learningmethod to reﬁne mesh shapes and achieve very good perfor-mance. However, like D hand joint locations, hand mesheshighly rely on additional supervision from hand segmentmaps which are typically not available in nowadays handpose datasets. The method by boukhayma et al . [3] is theonly extra-data-free method, but its performance is limited.

Generating images using generative adversarial net-works (GAN) [13] has gained remarkable progress. Manyapproaches explore how to better manipulate images by ap-plying GAN models [14, 17, 37, 7]. Isola et al . [17] proposethe Pix2Pix network which translates label or edges maps tosynthesized photos, reconstructs objects from edge maps,or colorizes images. Zhu et al . [37] introduce the cycle-consistent generated adversarial network (CycleGAN). Cy-cleGAN uses the cycle consistency loss to disentangle theinput and output pair and therefore does not need pairedinput. Hoffman et al . [14] propose cycle-consistent adver-sarial domain adaptation (CyCADA). Compared to Cycle-GAN, CyCADA contains a segmentation loss. As a result,CyCADA not only translates images from one modality toanother but also deals with a speciﬁc visual task.Applying the generative adversarial model to RGB handimages for hand pose estimation is also gaining popularity.Muller et al . [22] introduce the geometry consistent GAN(GeoConGAN) to generate synthetic image data for train- ing. Chen et al . [6] propose the tonality-alignment gener-ative adversarial networks (TAGAN) for producing morerealistic images from synthetic images for hand pose esti-mator training. However, these methods only focus on gen-erating RGB images. None of them generates depth mapsfor assisting hand pose estimator training.

3. Our Approach

Our goal is to estimate the D hand pose from a monoc-ular RGB hand image. Although the existing state-of-the-art methods [3, 25, 33] have shown that training networkswith RGB and depth images can improve the D hand poseestimators, few D hand pose datasets consist of pairedRGB and depth images. To deal with the lack of paireddata issue, we propose a novel adversarial neural network,called depth-map guided generated adversarial networks(DGGAN) illustrated in Figure 2, which can jointly learn toinfer the depth map from an RGB image of hand and to es-timate D hand pose. In the following, we give an overviewof the proposed DGGAN and describe the two major mod-ules of DGGAN in detail.

The proposed DGGAN consists of two major modules, a depth-map reconstruction module and a hand pose estima-tion module. Its network architecture is shown in Figure 2.Given an RGB hand image I , we want to estimate the K D hand joint locations J xyz ∈ R × K . Each column in the × K matrix is a vector of size and represents the ( x, y, z ) coordinates of a joint, i.e., J xyz = [ J xyz , J xyz , . . . , J xyzK ] .The two modules in the proposed DGGAN G are trainedby using the GAN loss L GAN and the task loss L task , re-spectively. The objective of learning G is formulated as amin-max game: G ∗ = arg min G max D ( λ t L task + λ g L GAN ) , (1) igure 3. Network architecture of the depth-map reconstructionmodule. where λ t and λ g control the relative importance of thesetwo loss terms.Given an RGB hand image, our depth-map reconstruc-tion module tries to generate its corresponding depth map.A set of unpaired training depth images is adopted to trainthe depth-map reconstruction module so that its inferreddepth maps are similar to real ones. To achieve that, thediscriminator in this module works on distinguishing realdepth maps from fake (generated) ones. Section 3.2 de-scribes the details of depth-map reconstruction. The depthmap inferred from the depth-map reconstruction module to-gether with the input RGB image are fed to the hand poseestimation module for estimating the D hand pose. In thehand pose estimation module, the input RGB image is usedto regress the D hand pose. The inferred depth-map isadopted to regularize the predicted D hand pose. The lossfor hand pose estimation L task is adopted for optimization.Section 3.3 describes the details. The depth-map reconstruction module aims at relaxingthe requirement of paired RGB and depth images duringtraining. This module is constructed via an adversarial net-work that infers the depth map according to an input RGBimage. Figure 3 shows the network architecture of this mod-ule. In the training phase, our network requires both depthand RGB training images. Nevertheless, the RGB and depthimages do not need to be paired. We consider the processof inferring depth map from its corresponding RGB im-age as an unsupervised adaptation problem, where the RGBmodality S and depth modality T are both provided. We aregiven a set of RGB images X S and a set of real depth maps X T . To translate from S to T , we adopt an encoder-decoderarchitecture G S → T . The generator G S → T is trained to gen-erate a realistic depth map to fool the discriminator D while D id derived to distinguish the real data x t and generatedfake data G S → T ( x s ) . The loss for the depth-reconstruction modules is as follows: L GAN ( G S → T , D, X S , X T ) = E x t ∼ X T [log D ( x t )]+ E x s ∼ X S [log(1 − D ( G S → T ( x s )))] . (2)This loss also provides semantic constraints to force thegenerator to produce more realistic depth maps. By takingas input unpaired RGB and depth images, our depth-mapreconstruction module becomes applicable to vastly morehand pose datasets. Furthermore, we can train the networkwith a large amount of unpaired RGB and depth images. Given an inferred depth map computed by the depth-mapreconstruction module, we combine it with the input RGBimage and feed both to the hand pose estimation module.The network architecture of the hand pose estimation mod-ule is shown in Figure 4. The hand pose estimation modulecalculates the task loss L task , which is composed of twoterms L task = L D + L z . The D hand regression loss L D and depth regularization loss L z are described in sec-tion 3.3.1 and 3.3.2, respectively. Previous studies [4] show that depth information can beused to build a powerful regularizer. We leverage the depthregularizer for improving the result of D HPE. Unlikemost previous works where the ground-truth depth mapsare needed, our model uses a synthetic depth map generatedby the depth-map reconstruction module. Our experimentalresults show that training with such synthetic depth mapssubstantially helps improve the result of direct regression. D hand pose regression takes an RGB image and aninferred depth map as input and outputs joint locations intwo steps. In the ﬁrst step, we adopt a popular variant ofthe CPM architecture [5, 30] as the D joint location pre-dictor. This predictor consists of six stages. Each stagecontains seven convolutional layers followed by a RectiﬁedLinear Unit (ReLu). It predicts K heatmaps { H ks } Kk =1 for K different hand joints. The pixel value in k th heatmap atstage s , H ks , indicates the conﬁdence that the k th joint islocated at this position. Following the convention [30], theground-truth heatmap is denoted as { H k ∗ } Kk =1 . Each H k ∗ is the Gaussian blur of the Dirac- δ distribution centered atthe ground-truth location of k th joint. We train this partof Hand Pose module by standard backpropogation and themean square error (MSE) loss. In addition to the MSE loss,we add the intermediate supervision for each stage. Theﬁnal loss for D location prediction is L D = 16 K (cid:88) s =1 K (cid:88) k =1 || H ks − H k ∗ || F . (3) igure 4. Architecture of the hand pose estimation module. This module takes paired RGB images and inferred depth maps as inputs. D CPM consumes an RGB image as input and produces the hand joint heatmap. The joint heatmap is fed to the regression network toestimate the D joint locations with the aid of a depth regularizer. The depth regularizer reconstructs the depth map from D joint locationsand is trained using L1 loss and the GAN-synthesized depth map as guidance.

In the second step, the regression network takes theheatmap from CPM as input, and outputs the relative depth.Its architecture is a mini-CPM (one stage instead of six) fol-lowed by three fully connected layers. Z ∈ R K × denotesthe relative depth of each hand joint. We employ smooth L1loss between Z and the ground-truth Z ∗ . The loss of depthregression L z is summarized as follows: L z = 1 K K (cid:88) k =1 

12 ( Z k − Z ∗ k ) , if | Z k − Z ∗ k | ≤ . | Z k − Z ∗ k | , otherwise . (4) To provide supervision on every pixel on a depth map, weemploy the depth regularizer (DR) proposed in [4]. Thedepth regularizer takes the relative depth as input and pre-dicts a relative depth map D . It reshapes Z ∈ R K × to a K × × tensor, which is considered as a K -channel imageinput. We then up-sample this image from K -channel withresolution × to -channel with the original depth mapresolution ( n × m ) through the 6 layers of transposed CNN.We take L1 norm between D and the ground-truth rela-tive depth map D ∗ as depth regularizer loss L dep , i.e., L dep = || D − D ∗ || , (5)where D ∗ is obtained by input depth map ˆ D ∗ as follows D ∗ = ˆ D ∗ − min ˆ D ∗ max ˆ D ∗ − min ˆ D ∗ . (6)Note that, we only use the ground-truth depth map ˆ D ∗ during the initialization stage. It would be replaced byDGGAN-generated depth maps once the initialization stageends. Figure 5. Some examples of the three benchmark datasets usedfor evaluation.

Top Row:

The RHD dataset [38] provides syn-thetic hand images with D hand keypoint annotations.

MiddleRow:

The STB dataset [35] contains real hand images with Dkeypoints.

Bottom Row:

The MHP [12] offers real hand imageswith D keypoints.

Combining the loss terms described in Section 3.3 andSection 3.3.2, we summarize the loss function for the handpose estimation module as L task = λ z ∗ L z + λ D ∗ L D + λ dep ∗ L dep , (7)where λ z , λ D , λ dep control the importance of three differ-ent loss terms, respectively.

4. Experimental Settings

This section introduces our experimental settings. Theselected benchmark datasets for performance evaluation areﬁrst given. The evaluation metric and training details arethen presented.

We conduct the experiments on three benchmarkdatasets, including the stereo hand tracking benchmarka) (b) (c)

Figure 6. Comparisons with the state-of-the-art approaches on the (a) STB, (b) RHD, and (c) MHP datasets for D hand pose estimation. (STB) [35], the render hand pose dataset (RHD) [38], andthe multi-view hand pose (MHP) dataset[12].The STB dataset is a dataset of real hands. It containstwo different subsets called SK and BB. The images inSK are captured by Point Grey Bumblebee2 stereo camerawhile images in BB are from a depth sensor. In our ex-periments, we use the BB subset for DGGAN training, andleverage the SK subset for unpaired testing.RHD is a synthetic dataset. Zhang et al . [35] use a Dsimulator, Maya, to render the images from differentcharacters doing actions. Each data entry consists of anRGB image and the corresponding depth image, and both D/ D annotations. This dataset is challenging since its im-ages are captured with various view points and of many dif-ferent hand shapes.The MHP dataset provides color hand images as well asthe bounding boxes of hands and the D and D locationof each joint. It consists of hand imaegs of people withdifferent hand movements. For each frame, it provides theimages from four different angles of view. The D and Dannotations are obtained by Leap Motion Controller.Before training, we ﬁrst crop the hand regions from theoriginal canvas to make sure that hand parts have dominat-ing proportion in the frame. Notice that the STB and MHPdatasets use the center of a palm rather than a wrist as oneof its hand keypoints. Hence, we revise the annotation tomove the center of the palm to the wrist in the same wayperformed in [4].

Following the previous works [4, 6, 38], we evaluate theresults of hand pose estimation by using 1) the area underthe curve (AUC) on percentage of correct keypoints (PCK)between threshold mm and mm (AUC

20 50 ) and 2)the end-point-error (EPE): the distance between predicted D joint locations and the ground truth. In Table 1, we re-port the AUC

20 50 as well as the mean and the median ofEPE over all hand keypoints.

During training, we ﬁrst initialize the weights of thedepth-map reconstruction and hand pose estimation mod-ules in the proposed DGGAN. Both modules are initializedby ﬁtting the STB dataset (see 4.1) but trained separately.Then, we connect the two modules and ﬁne-tune the wholenetwork in an end-to-end manner. For training with theRHD and STB dataset, the discriminator is derived to distin-guish the G S → T ( x s ) and x t , a randomly chosen depth-mapfrom the respective dataset. For the MHP dataset, we sim-ply randomly assign a depth-map from RHD dataset as x t because the MHP dataset does not contain any dense depthmaps.

5. Experimental Results

For evaluation on the STB dataset, we choose PSO [19],ICPPSO [25], and CHPR [27] as the baselines. In addition,we select the state-of-the-art approaches, Z&B [38] and thatby Cai et al . [4] for comparison.On the RHD dataset, we compare our method withZ&B [38] and that in [4]. Also, on the MHP dataset, wecompare our method to that in [4]. Note that Cai et al . [4]have not released their code yet. We re-implement theirmethod and report the results according to our implemen-tation.

For analyzing the effectiveness of the proposed DG-GAN, we conduct ablation studies for DGGAN on threedifferent datasets. The detailed results are summarized inTable 1. Speciﬁcally, we conduct the experiments for thefollowing three different settings:1. Regression: It represents training the regression net-work only on RGB images and without any depth reg-ularizer.2. Regression + DR + DGGAN: We learne the depth-regularized regression network using RGB images igure 7. Comparison between the generated and ground-truth depth maps on the RHD dataset. The ﬁrst and fourth columns show theRGB images. The second and ﬁfth columns display the real depth maps. The third and sixth columns give the generated depth maps.Figure 8. Comparison between the generated and ground-truth depth maps on the STB dataset. The ﬁrst and fourth columns show the RGBimages. The second and ﬁfth columns display the real depth maps. The third and sixth columns give the generated depth maps. with the depth maps generated by DGGAN.3. Regression + DR + true depth map: We derive thedepth-regularized regression network using RGB im-ages with their paired true depth maps.To measure the effectiveness of the generated depthmaps, we compare settings Regression and Regression +DR + DGGAN. As illustrated in Table 1, using the gen-erated depth map signiﬁcantly boosts the performance of the model in Regression. The AUC

20 50 is improved by , , on the RHD, STB, and MHP datasets,respectively. The EPE mean is also considerable reduced by and and on the RHD, STB and MHPdatasets respectively.To compare the generated depth map with the real depthmaps, we conduct two more experiments. Comparing re-sults of Regression + DR + true depth map and Regression+ DR + DGGAN shows that the generated depth maps are able 1. D pose estimation results on the RHD, STB, MHP datasets. ↑ : higher is better. ↓ : lower is better. Regression is the previousState-of-the-art without using paired depth maps.

AUC 20-50 ↑ EPE mean (mm) ↓ EPE median (mm) ↓ RHD DatasetRegression 0.816 21.5 13.96Regression + DR + DGGAN

Regression + DR + true depth map 0.859 18.0 13.16STB DatasetRegression 0.976 10.91 9.11Regression + DR + DGGAN

Regression + DR + true depth map 0.984 10.05 8.44MHP DatasetRegression 0.928 14.08 10.75Regression + DGGAN

Table 2. EPE mean comparison on the STB dataset between our approach and the method by Boukhayma et al . [3]

Method EPE mean (mm) ↓ Regression + DR + DGGAN (Ours)

Boukhayma et al . [3] 9.76a key factor of performance boosting. On the RHD dataset,training with the generated depth maps is only slightlyworse than the true RHD depth maps by . in AUC

20 50 and mm in EPE mean. However, on the STB dataset, theresults of training with generated depth maps are even bet-ter than training with the real depth maps (by . in AUC

20 50 and . mms in EPE mean). This result is proba-ble due to the fact that the depth maps collected from depthsensors are less stable and noisier than the depth maps col-lected from a D simulator. By training the DGGAN withunpaired high-quality depth maps from RHD, our genera-tor can potentially reduce the noise, and further beneﬁt thetraining in the hand pose estimation module. It is worthnoting that Regression + DR + true depth map requires thepaired depth and RGB image.In addition to the quantitative analysis, Figure 7 and Fig-ure 8 provide some examples for visual comparison be-tween the generated and true depth maps on the RHD andSTB datasets, respectively. We can see that the generateddepth maps are visually very similar to the ground-truthones.

We select the state-of-the-art approaches [3, 4, 23, 26,35, 22, 38] for comparison. The comparison results are re-ported in Figure 6 and Table 2. As shown in Figure 6 andTable 2, our approach outperforms all existing state-of-the-art methods. Although the results of the method by Cai etal . [4] come close to ours, we emphasize that our DGGANhas an crucial advantage of not requiring any paired RGBand depth images.

6. Conclusion

The lack of large-scale datasets of paired RGB and depthimages is one of the major bottlenecks for improving Dhand pose estimation. To address this limitation, we pro-pose a conditional GAN-based model called DGGAN tobridge the gap between RGB images and depth maps. DG-GAN synthesizes depth maps from RGB images to regu-larize the D hand pose prediction model during training,eliminating the need of paired RGB images and depth mapsconventionally used to train such models.The proposed DGGAN is integrated into a D hand poseprediction framework, and is trained end-to-end togetherfor D pose estimation. DGGAN not only generates morerealistic hand depth images, which can be used in manyother applications such as D shape estimation but also re-sults in signiﬁcant improvement in D hand pose estima-tion, achieving new state-of-the-art results.

Acknowledgement.

This work was supported in part byMinistry of Science and Technology (MOST) under grantsMOST 107-2628-E-001-005-MY3 and MOST 108-2634-F-007-009.

References [1] S. Antoshchuk, M. Kovalenko, and J. Sieck. Gesturerecognition-based human–computer interaction interface formultimedia applications. In

Digitisation of Culture: Namib-ian and International Perspectives . 2018.[2] S. Baek, K. I. Kim, and T.-K. Kim. Pushing the envelope forrgb-based dense 3d hand pose estimation via neural render-ing. In

CVPR , 2019.3] A. Boukhayma, R. d. Bem, and P. H. Torr. 3d hand shapeand pose from images in the wild. In

CVPR , 2019.[4] Y. Cai, L. Ge, J. Cai, and J. Yuan. Weakly-supervised 3dhand pose estimation from monocular rgb images. In

ECCV ,2018.[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds. In

CVPR ,2017.[6] L. Chen, S.-Y. Lin, Y. Xie, H. Tang, Y. Xue, Y.-Y. Lin,X. Xie, and W. Fan. Tagan: Tonality-alignment generativeadversarial networks for realistic hand pose synthesis. In

BMVC , 2019.[7] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang. Cr-DoCo: Pixel-level domain transfer with cross-domain con-sistency. In

CVPR , 2019.[8] X. Deng, S. Yang, Y. Zhang, P. Tan, L. Chang, and H. Wang.Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224 , 2017.[9] L. Ge, Y. Cai, J. Weng, and J. Yuan. Hand pointnet: 3d handpose estimation using point sets. In

CVPR , 2018.[10] L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan.3d hand shape and pose estimation from a single rgb image.In

CVPR , 2019.[11] L. Ge, Z. Ren, and J. Yuan. Point-to-point regression point-net for 3d hand pose estimation. In

ECCV , September 2018.[12] F. Gomez-Donoso, S. Orts-Escolano, and M. Cazorla. Large-scale multiview 3d hand pose dataset. arXiv preprintarXiv:1707.03742 , 2017.[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In

NIPS , 2014.[14] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko,A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adver-sarial domain adaptation. arXiv preprint arXiv:1711.03213 ,2017.[15] Y.-P. Hung and S.-Y. Lin. Re-anchorable virtual panel inthree-dimensional space, 2016. US Patent 9,529,446.[16] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz.Hand pose estimation via latent 2.5 d heatmap regression. In

ECCV , 2018.[17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-imagetranslation with conditional adversarial networks. In

CVPR ,2017.[18] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d de-formation model for tracking faces, hands, and bodies. In

CVPR , 2018.[19] J. Kennedy. Particle swarm optimization.

Encyclopedia ofmachine learning , 2010.[20] S. Li and D. Lee. Point-to-pose voting based hand pose es-timation using residual permutation equivariant layer. arXivpreprint arXiv:1812.02050 , 2018.[21] S.-Y. Lin, C.-K. Shie, S.-C. Chen, and Y.-P. Hung. Airtouchpanel: a re-anchorable virtual touch panel. In MM , 2013.[22] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Srid-har, D. Casas, and C. Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. In CVPR , 2018. [23] P. Panteleris, I. Oikonomidis, and A. Argyros. Using a singlergb frame for real time 3d hand pose estimation in the wild.In

WACV , 2018.[24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classiﬁcation and segmentation.In

CVPR , 2017.[25] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime androbust hand tracking from depth. In

CVPR , 2014.[26] A. Spurr, J. Song, S. Park, and O. Hilliges. Cross-modal deepvariational hand pose estimation. In

CVPR , June 2018.[27] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascadedhand pose regression. In

CVPR , 2015.[28] B. Tekin, F. Bogo, and M. Pollefeys. H+ o: Uniﬁed ego-centric recognition of 3d hand-object poses and interactions.

CVPR , 2019.[29] C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense 3d re-gression for hand pose estimation. In

CVPR , 2018.[30] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-volutional pose machines. In

CVPR , 2016.[31] X. Wu, D. Finnegan, E. O’Neill, and Y.-L. Yang. Handmap:Robust hand pose estimation via intermediate dense guid-ance map supervision. In

ECCV , September 2018.[32] L. Yang and A. Yao. Disentangling latent hands for imagesynthesis and pose estimation.

CVPR , 2019.[33] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon,J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz,S. Honari, L. Ge, J. Yuan, X. Chen, G. Wang, F. Yang,K. Akiyama, Y. Wu, Q. Wan, M. Madadi, S. Escalera, S. Li,D. Lee, I. Oikonomidis, A. Argyros, and T.-K. Kim. Depth-based 3d hand pose estimation: From current achievementsto future goals. In

CVPR , 2018.[34] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, andP. Presti. American sign language recognition with thekinect. In

ICMI , 2011.[35] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang.3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 , 2016.[36] W. Zhe, C. Liyan, R. Shaurya, S. Daeyun, and F. Charless.Geometric pose affordance: 3d human pose with scene con-straints. In arxiv , 2019.[37] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. In

ICCV , 2017.[38] C. Zimmermann and T. Brox. Learning to estimate 3d handpose from single rgb images. In