[PDF] Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos

Abstract

Estimating 3D hand pose directly from RGB imagesis challenging but has gained steady progress recently bytraining deep models with annotated 3D poses. Howeverannotating 3D poses is difficult and as such only a few 3Dhand pose datasets are available, all with limited samplesizes. In this study, we propose a new framework of training3D pose estimation models from RGB images without usingexplicit 3D annotations, i.e., trained with only 2D informa-tion. Our framework is motivated by two observations: 1)Videos provide richer information for estimating 3D posesas opposed to static images; 2) Estimated 3D poses oughtto be consistent whether the videos are viewed in the for-ward order or reverse order. We leverage these two obser-vations to develop a self-supervised learning model calledtemporal-aware self-supervised network (TASSN). By en-forcing temporal consistency constraints, TASSN learns 3Dhand poses and meshes from videos with only 2D keypointposition annotations. Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the benefit of the temporalconsistency in constraining 3D prediction models.

Full PDF

TTemporal-Aware Self-Supervised Learning for 3D Hand Pose and MeshEstimation in Videos

Liangjian Chen , Shih-Yao Lin , Yusheng Xie *3 , Yen-Yu Lin , and Xiaohui Xie University of California, Irvine , Tencent America , Amazon , National Chiao Tung University , { liangjc2,xhx } @ics.uci.edu , [email protected] , [email protected] , [email protected] Abstract

Estimating 3D hand pose directly from RGB imagesis challenging but has gained steady progress recently bytraining deep models with annotated 3D poses. Howeverannotating 3D poses is difﬁcult and as such only a few 3Dhand pose datasets are available, all with limited samplesizes. In this study, we propose a new framework of training3D pose estimation models from RGB images without usingexplicit 3D annotations, i.e., trained with only 2D informa-tion. Our framework is motivated by two observations: 1)Videos provide richer information for estimating 3D posesas opposed to static images; 2) Estimated 3D poses oughtto be consistent whether the videos are viewed in the for-ward order or reverse order. We leverage these two obser-vations to develop a self-supervised learning model calledtemporal-aware self-supervised network (TASSN). By en-forcing temporal consistency constraints, TASSN learns 3Dhand poses and meshes from videos with only 2D keypointposition annotations. Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the beneﬁt of the temporalconsistency in constraining 3D prediction models.

1. Introduction

3D hand estimation is an important research topic incomputer vision due to a wide range of potential applica-tions, such as sign language translation [45], robotics [1],movement disorder detection and monitoring, and human-computer interaction (HCI) [29, 18, 28].Depth sensors and RGB cameras are popular devices forcollecting hand data. However, depth sensors are not aswidely available as RGB cameras and are much more ex-pensive, which has limited the applicability of hand poseestimation methods developed upon depth images. Recentresearch interests have shifted toward estimating 3D handposes directly from RGB images by utilizing color, texture, * Work done outside of Amazon.

Hand Pose Estimator3D Hand Pose Estimator ?

3D Hand Pose Estimator3D Hand Pose Estimator3D Hand Pose Estimator ?

3D Hand Pose Estimator F o r w a r d I n f e r en c e

3D Hand Pose Estimator B a ck w a r d I n f e r en c e Self-supervision

Unlabeled Training Data (a)(b)

Figure 1. Motivation and idea: (a) Training a robust 3D hand poseestimator from RGB images relies on plenty images with 3D handpose annotations, but obtaining 3D annotations from 2D images isquite difﬁcult; (b) We leverage bi-directional temporal consistencyin videos and enable hand pose estimators to make more plausiblepredictions. It turns out that the hand pose estimator can be derivedin a self-supervised fashion without using 3D annotations. and shape information contained in RGB images. Somemethods carried out 3D hand pose estimation from monoc-ular RGB images [5, 20, 51]. More recently, progresseshave been made on estimating 3D hand shape and meshfrom RGB images [2, 3, 15, 47, 42, 7, 26, 25, 50, 49, 48].Compared to poses, hand meshes provide richer informa-tion required by many immersive VR and AR applications.Despite the advances, 3D hand pose estimation remains achallenging problem due to the lack of accurate, large-scale3D pose annotations.In this work, we develop a new approach to 3D hand a r X i v : . [ c s . C V ] D ec ose and mesh estimation by taking the following two ob-servations into account. First, most existing methods rely ontraining data with 3D information, but capturing 3D infor-mation from 2D images is intrinsically difﬁcult. Althoughthere are a few datasets providing annotated 3D hand joints,the amount is too small to train a robust hand pose estima-tor. Second, most studies focus on hand pose estimationfrom a single image. Nevertheless, important applicationsbased on 3D hand poses, such as augmented reality (AR),virtual reality (VR), and sign language recognition, are usu-ally carried out in videos.According to the two observations, our approach ex-ploits video temporal consistency to address the uncer-tainty caused by the lack of 3D joint annotations on train-ing data. Speciﬁcally, our approach, called temporal-awareself-supervised network (TASSN) , can learn and infer 3Dhand poses without using annotated 3D training data. Fig-ure 1 shows the motivation and core idea of the proposedTASSN. TASSN explores video information by embeddinga temporal structure to extract spatial-temporal features. Wedesign a novel temporal self-consistency loss, which helpstraining the hand pose estimator without requiring anno-tated 3D training data. In addition to poses, we estimatehand meshes since meshes bring salient evidences for poseinference. With meshes, we can infer silhouettes to furtherregularize our model. The main contributions of this workare given below:1. We develop a temporal consistency loss and a reversedtemporal information technique for extracting spatio-temporal features. To the best of our knowledge, thiswork makes the ﬁrst attempt to estimate 3D hand posesand meshes without using 3D annotations.2. An end-to-end trainable framework, named temporal-aware self-supervised networks (TASSN), is proposedto learn an estimator without using annotated 3D train-ing data. The learned estimator can jointly infer the 3Dhand poses and meshes from video.3. Our model achieves high accuracy with 3D predic-tion performance on par with state-of-the-art modelstrained with 3D ground truth.

2. Related Work

Since depth images contain surface geometry informa-tion of hands, they are widely used for hand pose estima-tion in the literature [40, 44, 11, 41, 14, 16, 27, 8, 9]. Mostexisting work adopts regression to ﬁt the parameters of a de-formed hand model [30, 22, 24, 40]. Recent work [14, 16]extracts depth image features and regress the joints throughPointNet [34]. Wu et al . [41] leverage the depth image as the intermediate guidance and conduct an end-to-end train-ing framework. Despite the effectiveness, the aforemen-tioned methods highly rely on accurate depth maps, and areless practical in the daily life since depth sensors are notavailable in many cases due to the high cost.

Owing to the wide accessibility of RGB cameras, esti-mating 3D hand poses from monocular images becomes anactive research topic [5, 20, 31, 38, 43, 51] and signiﬁcantimprovement has been witnessed. These methods use con-volutional neural networks (CNN) to extract features fromRGB images. Zimmermann and Brox [51] feed these fea-tures to the 3D lift network and camera parameter estima-tion network for depth regression. Building on Zimmer-mann and Brox’s work, Iqbal et al . [20] add depth maps asintermediate guidance while Cai et al . [5] propose a weaklysupervised approach to utilize depth maps for regulariza-tion. However, these methods suffer from limited trainingdata since 3D hand annotations are hard to acquired. Also,they all dismiss the temporal information.

3D hand mesh estimation is an active research topic [15,3, 2, 21, 47]. Methods in [3, 2, 47] estimate hand meshesby using a pre-deﬁned hand model, named MANO [35].Due to the high degree of freedom of hand gestures, handmeshes lie in a high dimensional space. The MANO modelserves as a kinematic and shape prior of meshes and canhelp reduce the dimension. However, since MANO is a lin-ear model, it is not able to capture the nonlinear transfor-mation for hand meshes [15]. Thus, mesh estimators basedon MANO suffer from this issue. On the other hand, Ge etal . [15] regress 3D mesh vertices through graphical con-volutional neural network (GCN) with down-sampling andup-sampling. Their work achieves the state-of-the-art per-formance, but it is trained on a dataset with 3D mesh groundtruth which is even more difﬁcult to label than 3D joint an-notations. This drawback limits its applicability in practice.

Self-supervised learning [12, 33, 13] is a type of train-ing methodologies, where training data are automaticallylabeled by exploiting existing information within the data.With this training scheme, manual annotations are not re-quired for a given training set. This scheme is especiallybeneﬁcial when data labeling is difﬁcult or the data sizeis exceedingly large. Self-supervised learning has beenapplied to hand pose estimation. Similar to ours, themethod in [13] adopts temporal cycle consistency for self-supervised learning. However, this method uses soft nearestneighbors to solve the video alignment problem, which isnot applicable to 3D pose and mesh estimation. Simon et .. PMEPMEPME 𝑚 " 𝑝 " ... F o r w a r d I n f e r e n c e 𝑚 "$%&’ 𝑝 "$%&’ 𝐼 " 𝐼 "$’ 𝐼 "$) 𝐼 "$%&’ 𝐼 "$% 𝐻 "$’ 𝑚 "$’ 𝑝 "$’ 𝐻 "$%&’ 𝑚+ " 𝑝, " 𝑚+ "$%&’ 𝑝, "$%&’ 𝑚+ "$’ 𝑝, "$’ PMEPMEPME ... 𝐻- "$’ 𝐻- "$) Self-Supervision 𝐼 "&’ 𝐼 "$’ 𝐼 "$) 𝐼 "$%&’ 𝐼 "$% B a c k w a r d I n f e r e n c e ... 𝓛 /0 𝓛 /1 𝓛 /0 𝓛 /1 𝓛 /0 𝓛 /1 𝓛 /0 : Temporal Consistency Loss of Hand Mesh 𝓛 /1 : Temporal Consistency Loss of Hand Pose Figure 2. Overview of the proposed TASSN. TASSN involves both forward and backward inference to utilize temporal information.Namely, the hand poses estimated by forward and backward inference should be consistent. Our hand pose estimator leverages thisobservation and can be trained by using self-supervised learning without the need of 3D hand joint labels. Moreover, with the constraintsof temporal consistency, either forward or backward inference can gain more accurate hand pose estimation results. al . [37] adopt multi-view supervisory signals to regress 3Dhand joint locations. While their approach resolves the handself-occlusion issue using multi-view images, it in the train-ing stage requires 3D joint annotations, which are difﬁ-cult and expensive to get in this task. Another attempt ofusing self-supervised learning for hand pose estimation ispresented in [39], where an approach leveraging a massiveamount of unlabeled depth images is proposed. However,this approach may be limited due to the high variations ofdepth maps in diverse poses, scales, and sensing devices.Instead of leveraging multi-view consistency or depth con-sistency, the proposed self-supervised scheme relies on tem-poral consistency, which is inexpensive to get and does notrequire 3D keypoint annotations.

3. Proposed Method

We aim to train a 3D hand pose estimator from videoswithout 3D hand joint labels. To tackle the absence of3D annotations, we adopt the temporal information fromhand motion videos, and address the ambiguity caused bythe lack of 3D joint ground truth. Speciﬁcally, we presenta novel deep neural network, named temporal-aware self-supervised networks (TASSN). By developing the temporalconsistency loss on the estimated hand gestures in a video, TASSN can learn and infer 3D hand poses through self-supervised learning without using any D annotations.

Given an RGB hand motion video x with N frames, x = { I , ..., I N } , we aim at estimating D hand poses in thisvideo, where I t ∈ R × W × H is the t -th frame, and W and H are the frame width and height, respectively. The Dhand pose at frame t , p t ∈ R × K , is represented by a set of K D keypoint coordinates of the hand. Figure 2 illustratesthe network architecture of TASSN.Leveraging the temporal consistency properties ofvideos, the hand poses and meshes predicted in the for-ward and backward inference orders can perform mutualsupervision. Our model can be ﬁne-tuned on any targetdataset using this self-supervised learning and the tempo-ral consistency is a good substitute for the hard-to-get 3Dground truth. TASSN alleviates the burden of annotating3D ground-truth of a dataset without signiﬁcantly sacriﬁc-ing model performance.Recent studies [15, 47] show that training pose estima-tors with hand meshes improves the performance becausehand meshes can act as intermediate guidance for hand poseprediction. To this end, we propose a hand pose and mesh low Estimator Heatmap Estimator features 𝐻 " 𝐻 " 𝐼 " 𝐼 " GCN Hand Mesh Estimator3D Hand Pose Estimator 𝑚 " 𝑝 " 𝑜 " 𝓛 * 𝓛 + 𝐼 " 𝑠 * 𝑠 + 𝓛 * : Heatmap Loss 𝓛 + : Hand Mesh Loss 𝐹 Figure 3. Network architecture of the pose and mesh estimation (PME) module. PME module consists of four sub-modules, including theﬂow, 2D keypoint heatmap, 3D hand mesh, and 3D hand pose estimators. The ﬂow estimator computes the optical ﬂow o t +1 from twoconsecutive frames I t and I t +1 . With I t +1 , o t +1 , and H t , the 2D heatmap estimator computes the keypoint heatmap H t +1 at timestamp t + 1 , as well as extract the image features. Based on the extracted image features, the 3D hand pose and mesh estimators predict the3D hand pose p t +1 and mesh m t +1 at timestamp t + 1 . Two loss terms, the heatmap loss L h and the hand mesh loss L m , are used foroptimization. estimation (PME) module, which jointly estimates the 2Dhand keypoint heatmaps, 3D hand poses and meshes fromevery two adjacent frames I i and I i +1 . The proposed PME module consists of four estima-tor sub-modules, including ﬂow estimator, 2D keypointheatmap estimator, 3D hand mesh estimator, and 3D handpose estimator. Given two consecutive frames as input, itestimates the 3D hand pose and mesh. Figure 3 shows itsnetwork architecture.

Flow Estimator : To capture temporal clues from a handgesture video, we adopt FlowNet [19] to estimate the opti-cal ﬂow o t +1 ∈ R × W × H between two consecutive frames I t and I t +1 . In forward inference, FlowNet computes o t +1 ,the motion from frame I t to frame I t +1 . In backward infer-ence, FlowNet computes the reverse motion. Heatmap Estimator : Our heatmap estimator computes 2Dhand keypoints and generates the features for the 3D handpose and mesh estimators. The estimated 2D keypointheatmaps are denoted by H ∈ R K × W × H , where K rep-resents the number of keypoints. We adopt a two stackedhourglass network [32] to infer the hand keypoint heatmaps H and compute the features F . We concatenate I t +1 , o t +1 ,and H t as input to the stacked hourglass network, whichproduces heatmaps H t +1 , as shown in Figure 3. The esti-mated H t +1 includes K heatmaps { H kt +1 ∈ R W × H } Kk =1 ,where H kt +1 expresses the conﬁdence map of the locationof the k th keypoint. The ground truth heatmap ¯ H kt +1 isthe Gaussian blur of the Dirac- δ distribution centered at theground truth location of the k th keypoint. The heatmap loss L h at frame t is deﬁned by L h = 1 K K (cid:88) k =1 || H kt − ¯ H kt || F . (1)

3D Hand Mesh Estimator : Our 3D hand mesh estimator isdeveloped based on Chebyshev spectral graph convolutionnetwork (GCN) [15], and it takes hand features F as inputand infers the 3D hand mesh. The output hand mesh m t ∈ R × C is represented by a set of D mesh vertices, where C is the number of vertices in a hand mesh.To model hand mesh, we use an undirected graph G ( V , E ) , where V and E are the vertex and edge sets,respectively. The edge set E can be represented by an ad-jacent matrix A , where A i,j = 1 if edge e ( i, j ) ∈ E , oth-erwise A i,j = 0 . The normalized Laplacian normal matrixof G is obtained via L = I − D − AD − , where D isthe degree matrix and I is the identity matrix. Since L isa positive semi-deﬁnite matrix [4], it can be decomposed as L = U Λ U T , where Λ = diag ( λ , λ , ..., λ C ) , and C isthe number of vertices in G .We follow the setting in [10], and set the convolutionkernel to ˆΛ = diag ( (cid:80) Si =0 α i λ i , ..., (cid:80) Si =0 α i λ iC ) , where α is the kernel parameter. The convolutional operations in G can be calculated by F (cid:48) = U ˆΛ U T F θ i = (cid:80) Si =0 α i L i F θ i ,where F ∈ R N × F in and F (cid:48) ∈ R N × F out indicate the inputand output features respectively, S is a preset hyperparam-eter used to control the receptive ﬁeld, and θ i ∈ R F in × F out istrainable parameter set used to control the number of outputchannels.The Chebyshev polynomial is used to reduce the modelcomplexity by approximating convolution operations, lead-ing to the output features F (cid:48) = (cid:80) Si =0 α i T i ( ˆ L ) θ i where igure 4. Examples of our estimated silhouettes. The ﬁrst andthird rows show the training images in STB and MHP datasets,respectively. The second and the fourth rows show the estimatedsilhouettes by our method. T k ( x ) is the k -th Chebyshev polynomial and ˆ L =2 L /λ max − I is used to normalize the input features.We adopt the scheme in [10, 15] to construct the handmesh in a coarse-to-ﬁne manner. We use the multi-levelclustering algorithm for coarsening the graph, and thenstore the graph at each level and the mapping between graphnodes in every two consecutive levels. In forward inference,the GCN ﬁrst up-samples the node features according to thestored mappings and graphs and then preforms the graphconvolutional operations. Mesh Silhouette Constraint : In our model, without 3Dmesh ground truth, the model tends to collapse to any kindof mesh as long as it is temporally consistent. To avoid thisissue, we introduce the mesh loss L m to calculate the differ-ence between the silhouette of the predicted hand mesh s t and the ground-truth silhouette ¯ s t at frame t . The silhouetteloss is deﬁned by L m = || s t − ¯ s t || F . (2)To obtain ¯ s t , we use GrabCut [36] to estimate the handsilhouettes from the training images. Some silhouettes es-timated from training images are shown in Figure 4. Thesilhouette of our predicted hand mesh s t is obtained by us-ing the neural rendering approach in [23].

3D Hand Pose Estimator : The proposed 3D pose estima-tor directly infers 3D hand keypoints p t from the predictedhand mesh m t . Taking the mesh as the input, we adopt anetwork of two stacked GCNs, which has a similar struc-ture to that used in 3D hand mesh estimator. We add a pool-ing layer to each GCN to extract the pose features from themesh. Those pose features are then fed to two fully con-nected layers to regress the 3D hand pose p t . Due to the lack of 3D keypoint annotations, conventionalsupervised learning schemes no longer work in model train-ing. We propose a temporal consistency loss L c to solvethis problem. Figure 2 shows the idea of our approach.Given a video clip with n frames, we feed every two ad-jacent frames { I i , I i +1 } t + ni = t to PME module for hand meshand pose estimation, i.e ., { p i , m i } t + ni = t . TASSN analyzesthe temporal information according to their relative inputorders. Thus, we can reverse the input order from { I i , I i +1 } to { I i +1 , I i } to infer the pose and mesh in I i from I i +1 .With this reversed temporal measurement (RTM) technique,we can infer the hand pose and mesh from the reversed tem-poral order. We denote the estimated pose and mesh in thereversed order as { ˜ p i , ˜ m i } t + ni = t . As shown in Figure 2, theprediction results estimated by the PME module in both for-ward and backward inference must be consistent with eachother since the same mesh and pose are estimated at anyframe. The temporal consistency loss on hand pose L pc andmesh L mc can be computed by L pc = 1 n t + n (cid:88) i = t || p i − ˜ p i || F , (3) L mc = 1 n t + n (cid:88) i = t || m i − ˜ m i || F . (4)The temporal consistency loss L c is deﬁned as the summa-tion of L mc and L pc , i.e ., L c = λ m L mc + λ p L pc , (5)where λ m and λ p are the weights of the correspondinglosses. Suppose we are given an unlabeled hand pose dataset X for training, which contains M hand gesture videos, X = { x ( i ) } Mi =1 , where video x ( i ) = { I , ..., I N } consistsof N frames. We divide each training video into severalvideo clips. Each training video clip v is with n frames, i.e ., v = { I t , I t +1 , ..., I t + n } . With the losses deﬁned inEq. (1), Eq. (2), and Eq. (5), the objective for training theproposed TASSN is L = λ s L m + λ h L h + L c , (6)where λ s and λ h denote the weights of the loss L m and theloss L h , respectively. The details of parameter setting aregiven in the experiments.

4. Experiments Setting

We evaluate our approach on two hand pose datasets,Stereo Tracking Benchmark Dataset (STB) [46] and Multi- igure 5. Some examples of the two hand pose datasets used forevaluation. The ﬁrst row shows examples of the STB dataset [46]and the second row gives examples of the MHP dataset [17]. Bothdatasets include real hand video sequences performed by differentsubjects and have 3D hand keypoint annotations view 3D Hand Pose dataset (MHP) [17]. These two datasetsinclude real hand video sequences performed by differentsubjects and 3D hand keypoint annotations are provided forthe hand video sequences.For the STB dataset, we adopt its SK subset for trainingand evaluation. This subset contains hand videos, eachof which has , frames. Following the train-validationsplit setting used in [15], we use the ﬁrst hand video as thevalidation set and the rest videos for training.The MHP dataset includes hand motion videos. Eachvideo provides hand color images and different kinds of an-notations for each sample, including the bounding box andthe 2D and 3D location on the hand keypoints.The following scheme of data pre-processing is appliedto both STB and MHP datasets. We crop the hand fromthe original image by using the center of hand and the scaleof the hand. Thus, the center of the hand is located at thecenter of the cropped images, and the cropped image coversthe whole hand. We then resize the cropped image to × . As mentioned in [5, 51], the STB and MHP datasetsuse the palm center as the center of the hand. We use themechanism introduced by [5] to change the center of handfrom the palm center to the joint of wrist. We follow the setting adopted in previous work [51, 15]and use average End-Point-Error (EPE) and

Area Underthe Curve (AUC) on the

Percentage of Correct Keypoints (PCK) between threshold millimeter (mm) and mm(AUC ) as the two metrics. Beside, we adopt AUC onPCK between threshold mm and mm (AUC ) as thethird metrics for evaluating 3D hand pose estimation perfor-mance. The measuring unit of EPE is millimeter (mm). We implement our TASSN by using PyTorch. In trainingphase, we set the batch size to and the initial learningrate to − . We train and evaluate our TASSN by using amachine with four GeForce GTX 1080Ti GPUs. Table 1. D hand pose estimation results on the STB and MHPdatasets. ↑ : higher is better; ↓ : lower is better; The measuring unitof EPE is millimeter (mm). AUC ↑ AUC ↑ EPE ↓ STB DatasetTASSN w/o L c L mc MHP DatasetTASSN w/o L c L mc Since end-to-end training a network from scratch withmultiple modules is very difﬁcult, we train our TASSN byusing a three-stage procedure. In the ﬁrst stage, we train theheatmap estimator with the loss L h . In the second stage,the GCN hand mesh estimator is initialized by using thepre-trained model provided by [15]. We jointly ﬁne-tuneheatmap and hand mesh estimator with the losses L h and L m on the target dataset without 3D supervision. In the ﬁ-nal stage, we conduct an end-to-end training for our TASSNand ﬁne-tune the weights of each sub-module. The modelweights of heatmap, GCN hand mesh estimator, and 3Dpose estimators are ﬁne-tuned end-to-end. In this stage, weset λ s = 0 . , λ h = 1 , and λ pc = λ mc = 10 .

5. Experimental Results

To study the impact of the proposed temporal consis-tency constraint, we train and evaluate TASSN under thefollowing three settings: 1) TASSN is trained without us-ing temporal consistency loss L c , i.e., without any tempo-ral consistency constraint; 2) TASSN is trained without us-ing temporal consistency loss of hand mesh L mc , i.e., withtemporal 3D pose constraint but not 3D mesh constraint; 3)TASSN is trained with all the proposed loss functions.Table 1 shows the evaluation results on two 3D hand poseestimation tasks under the three different settings describedabove. The PCK curves corresponding to different settingsare shown in Figure 6.We note the following two observations from the abla-tion study. First, the temporal consistency constraint is crit-ical for 3D pose estimation accuracy. This is clearly illus-trated by comparing the results between settings and .As shown in Figure 6, TASSN trained with the tempo-ral consistency loss L c (red curve, setting 3) outperformsthe TASSN trained without using temporal consistency loss(blue curve, setting 1) by a large margin on both the STBand MHP datasets. The quantitative results in Table 1 showthat AUC , AUC − and EPE, are improved by 0.232,a) (b) Figure 6. Performance in PCK on the (a) STB and (b) MHP datasets. TCL and TMCL denote the losses L c and L mc , respectively. (a) (b) Figure 7. Comparison with the state-of-the-arts. Results in AUC on (a) the STB dataset and (b) the MHP dataset. and set-ting . By using the temporal mesh consistency loss L mc ,AUC , AUC , EPE improves by . , . , . , re-spectively, on the STB dataset (Table 1). Results on MHPdataset share a same trend: Test AUCs are boosted by in-cluding the temporal mesh consistency loss L mc . It pointsout that the temporal mesh consistency loss, as an interme-diate constraint, facilitates 3D hand pose estimator learning.In addition to the quantitative analysis, Figure 8 and Fig-ure 9 display some estimated 3D hand poses for visual com-parison among these settings on the STB and MHP datasets,respectively. We can see that TASSN, when trained withtemporal consistency loss, can produce 3D hand pose esti-mations highly similar to the ground truth in diverse poses.It is worth noting that our GCN model is initialized withmodel [15] pretrained on the STB dataset. Our results onSTB demonstrate that the temporal consistency is criticalto enforce the 3D constraints, without which 3D predic- tion accuracy drops substantially (Table 1). Moreover, ourmethod generalizes well on other target datasets, e.g., theMHP dataset, where 3D annotations are not used in eithermodel initialization or training. The pose categories andcapturing environments are quite different between the twodatasets (Figure 5). The effectiveness of our method on theMHP dataset can only be attributed to the temporal consis-tency constraint (Figure 6). The state-of-the-art methods on both STB and MHPdatasets are trained with the 3D annotations, while ourmethod is not. Therefore, we take these methods as theupper bound of our method, and evaluate the performancegaps between these methods and ours.For the STB dataset, we select six the-state-of-the-artmethods for comparison. The selected methods includePSO [3], ICPPSO [8], CHPR [46], the method by Iqbal etal . [20], Cai et al . [5] and the approach by Zimmermannand Brox [51]. For the MHP dataset, we select two the-state-of-the-art methods for comparison including the ap-proach by Cai et al . [5] and the method by Chen et al . [6]. igure 8. Comparison among three different settings on the STB dataset. Columns 1 and 6 are RGB images. Columns 2 and 7 are theresult by TASSN trained without temporal consistency loss. Columns 3 and 8 are the result by TASSN trained without temporal meshconsistency loss. Columns 4 and 9 are the result by TASSN. Columns 5 and 10 are the ground truth.Figure 9. Comparison among three different settings on the MHP dataset. Columns 1 and 6 are RGB images. Columns 2 and 7 are theresult by TASSN trained without temporal consistency loss. Columns 3 and 8 are the result by TASSN trained without temporal meshconsistency loss. Columns 4 and 9 are the result by TASSN. Columns 5 and 10 are the ground truth.

Figure 7(a) and Figure 7(b) show the comparison results onSTB and MHP datasets, respectively. As expected, TASSNhas a performance gap with current state-of-the-art methodson both datasets due to the lack of 3D annotation. However,the performance gaps are relative small. In STB dataset, asshown in Figure 7(a), our methods could even beat some ofthe methods trained with full 3D annotations.All together, these results illustrate that 3D pose estima-tor can be trained without using 3D annotations. Estimat-ing hand pose and mesh from single frames is challengingdue to the ambiguities caused by the missing depth infor-mation and high ﬂexibility of joints. These challenges canbe partly mitigated by utilizing information from video, inwhich pose and the mesh are highly constrained by the ad-jacent frames. Temporal information offers an alternativeway of enforcing constraints on 3D models for pose andmesh estimation.

6. Conclusions

We propose a video-based hand pose estimation model,temporal-aware self-supervised network (TASSN), to learn and infer 3D hand pose and mesh from RGB videos. Byleveraging temporal consistency between forward and re-verse measurements, TASSN can be trained through self-supervised learning without explicit 3D annotations. Theexperimental results show that TASSN achieves reasonablygood results with performance comparable to state-of-the-art models trained with 3D ground truth.The temporal consistency constraint proposed here of-fers a convenient and yet effective mechanism for training3D pose prediction models. Although we illustrate the ef-ﬁcacy of the model without using 3D annotations, it canbe used in conjunction with direct supervision with a smallnumber of 3D labeled samples to improve accuracy.

Acknowledgement.

This work was supported in partby the Ministry of Science and Technology (MOST) un-der grants MOST 107-2628-E-009-007-MY3, MOST 109-2634-F-007-013, and MOST 109-2221-E-009-113-MY3,and by Qualcomm through a Taiwan University ResearchCollaboration Project. eferences [1] Svitlana Antoshchuk, Mykyta Kovalenko, and J¨urgen Sieck.Gesture recognition-based human–computer interaction in-terface for multimedia applications. In

Digitisation of Cul-ture: Namibian and International Perspectives . 2018.[2] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Push-ing the envelope for rgb-based dense 3d hand pose estimationvia neural rendering. In

CVPR , 2019.[3] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr.3d hand shape and pose from images in the wild. In

CVPR ,2019.[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-Cun. Spectral networks and locally connected networks ongraphs. arXiv preprint arXiv:1312.6203 , 2013.[5] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan.Weakly-supervised 3d hand pose estimation from monocu-lar rgb images. In

ECCV , 2018.[6] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin,Wei Fan, and Xiaohui Xie. Dggan: Depth-image guided gen-erative adversarial networks fordisentangling rgb and depthimages in 3d hand pose estimation. In

WACV , 2020.[7] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin,and Xiaohui Xie. Mvhm: A large-scale multi-view handmesh benchmark for accurate 3d hand pose estimation. In

WACV , 2021.[8] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Hui Tang, Yu-fan Xue, Yen-Yu Lin, Xiaohui Xie, and Wei Fan. Tagan:Tonality-alignment generative adversarial networks for real-istic hand pose synthesis. In

BMVC , 2019.[9] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Hui Tang, Yu-fan Xue, Xiaohui Xie, Yen-Yu Lin, and Wei Fan. Generatingrealistic training images based on tonality-alignment gener-ative adversarial networks for hand pose estimation. arXivpreprint arXiv:1811.09916 , 2018.[10] Micha¨el Defferrard, Xavier Bresson, and Pierre Van-dergheynst. Convolutional neural networks on graphs withfast localized spectral ﬁltering. In

NeurIPS , 2016.[11] Xiaoming Deng, Shuo Yang, Yinda Zhang, Ping Tan, LiangChang, and Hongan Wang. Hand3d: Hand pose estimationusing 3d neural network. arXiv preprint arXiv:1704.02224 ,2017.[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised visual representation learning by context prediction. In

CVPR , 2015.[13] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, PierreSermanet, and Andrew Zisserman. Temporal cycle-consistency learning. 2019.[14] Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan.Hand pointnet: 3d hand pose estimation using point sets. In

CVPR , 2018.[15] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, YingyingWang, Jianfei Cai, and Junsong Yuan. 3d hand shape andpose estimation from a single rgb image. In

CVPR , 2019.[16] Liuhao Ge, Zhou Ren, and Junsong Yuan. Point-to-pointregression pointnet for 3d hand pose estimation. In

ECCV ,2018. [17] Francisco Gomez-Donoso, Sergio Orts-Escolano, andMiguel Cazorla. Large-scale multiview 3d hand pose dataset. arXiv preprint arXiv:1707.03742 , 2017.[18] Yi-Ping Hung and Shih-Yao Lin. Re-anchorable virtualpanel in three-dimensional space, Dec. 27 2016. US Patent9,529,446.[19] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolu-tion of optical ﬂow estimation with deep networks. In

CVPR ,2017.[20] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall,and Jan Kautz. Hand pose estimation via latent 2.5 d heatmapregression. In

ECCV , 2018.[21] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap-ture: A 3d deformation model for tracking faces, hands, andbodies. In

CVPR , 2018.[22] David Joseph Tan, Thomas Cashman, Jonathan Taylor, An-drew Fitzgibbon, Daniel Tarlow, Sameh Khamis, ShahramIzadi, and Jamie Shotton. Fits like a glove: Rapid and reli-able hand shape personalization. In

CVPR , 2016.[23] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-ral 3d mesh renderer. In

CVPR , 2018.[24] Sameh Khamis, Jonathan Taylor, Jamie Shotton, Cem Ke-skin, Shahram Izadi, and Andrew Fitzgibbon. Learning anefﬁcient model of hand shape variation from depth images.In

CVPR , 2015.[25] Deying Kong, Yifei Chen, Haoyu Ma, Xiangyi Yan, and Xi-aohui Xie. Adaptive graphical model network for 2d hand-pose estimation. arXiv preprint arXiv:1909.08205 , 2019.[26] Deying Kong, Haoyu Ma, and Xiaohui Xie. Sia-gcn: Aspatial information aware graph neural network with 2dconvolutions for hand pose estimation. arXiv preprintarXiv:2009.12473 , 2020.[27] Shile Li and Dongheui Lee. Point-to-pose voting based handpose estimation using residual permutation equivariant layer. arXiv preprint arXiv:1812.02050 , 2018.[28] Shih-Yao Lin, Yun-Chien Lai, Li-Wei Chan, and Yi-PingHung. Real-time 3d model-based gesture tracking for multi-media control. In

ICPR , 2010.[29] Shih-Yao Lin, Chuen-Kai Shie, Shen-Chi Chen, and Yi-PingHung. Airtouch panel: a re-anchorable virtual touch panel.In MM , 2013.[30] Alexandros Makris and A Argyros. Model-based 3d handtracking with on-line hand shape adaptation. In BMVC ,2015.[31] Franziska Mueller, Florian Bernard, Oleksandr Sotny-chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, andChristian Theobalt. Ganerated hands for real-time 3d handtracking from monocular rgb. In

CVPR , 2018.[32] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In

ECCV , 2016.[33] Deepak Pathak, Ross Girshick, Piotr Doll´ar, Trevor Darrell,and Bharath Hariharan. Learning features by watching ob-jects move. In

CVPR , 2017.[34] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.Pointnet: Deep learning on point sets for 3d classiﬁcationand segmentation. In

CVPR , 2017.35] Javier Romero, Dimitrios Tzionas, and Michael J Black. Em-bodied hands: Modeling and capturing hands and bodies to-gether.

ACM Transactions on Graphics , 36(6):245, 2017.[36] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake.Grabcut: Interactive foreground extraction using iteratedgraph cuts. In

ACM Transactions on Graphics , volume 23,pages 309–314, 2004.[37] Tomas Simon, Hanbyul Joo, Iain Matthews, and YaserSheikh. Hand keypoint detection in single images using mul-tiview bootstrapping. In

CVPR , 2017.[38] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+ o: Uni-ﬁed egocentric recognition of 3d hand-object poses and in-teractions.

CVPR , 2019.[39] Chengde Wan, Thomas Probst, Luc Van Gool, and AngelaYao. Self-supervised 3d hand pose estimation through train-ing by ﬁtting. In

CVPR , 2019.[40] Chengde Wan, Thomas Probst, Luc Van Gool, and AngelaYao. Dense 3d regression for hand pose estimation. In

CVPR , 2018.[41] Xiaokun Wu, Daniel Finnegan, Eamonn O’Neill, and Yong-Liang Yang. Handmap: Robust hand pose estimation via in-termediate dense guidance map supervision. In

ECCV , 2018.[42] Zhenyu Wu, Duc Hoang, Shih-Yao Lin, Yusheng Xie,Liangjian Chen, Yen-Yu Lin, Zhangyang Wang, and WeiFan. Mm-hand: 3d-aware multi-modal guided hand genera-tive network for 3d hand pose synthesis. In MM , 2020.[43] Linlin Yang and Angela Yao. Disentangling latent hands forimage synthesis and pose estimation. CVPR , 2019.[44] Shanxin Yuan, Guillermo Garcia-Hernando, Bj¨orn Stenger,Gyeongsik Moon, Ju Yong Chang, Kyoung Mu Lee, PavloMolchanov, Jan Kautz, Sina Honari, Liuhao Ge, JunsongYuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama,Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera,Shile Li, Dongheui Lee, Iason Oikonomidis, Antonis Argy-ros, and Tae-Kyun Kim. Depth-based 3d hand pose estima-tion: From current achievements to future goals. In

CVPR ,2018.[45] Zahoor Zafrulla, Helene Brashear, Thad Starner, HarleyHamilton, and Peter Presti. American sign language recog-nition with the kinect. In

ICMI , 2011.[46] Jiawei Zhang, Jianbo Jiao, Mingliang Chen, Liangqiong Qu,Xiaobin Xu, and Qingxiong Yang. 3d hand pose track-ing and estimation using stereo matching. arXiv preprintarXiv:1610.07214 , 2016.[47] Xiong Zhang, Qiang Li, Wenbo Zhang, and Wen Zheng.End-to-end hand mesh recovery from a monocular rgb im-age. arXiv preprint arXiv:1902.09305 , 2019.[48] Zhengli Zhao, Sameer Singh, Honglak Lee, Zizhao Zhang,Augustus Odena, and Han Zhang. Improved consistency reg-ularization for gans. arXiv preprint arXiv:2002.04724 , 2020.[49] Zhengli Zhao, Samarth Sinha, Anirudh Goyal, Colin Raffel,and Augustus Odena. Top-k training of gans: Improving ganperformance by throwing away bad samples. arXiv preprintarXiv:2002.06224 , 2020.[50] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, andHan Zhang. Image augmentations for gan training. arXivpreprint arXiv:2006.02595 , 2020. [51] Christian Zimmermann and Thomas Brox. Learning to esti-mate 3d hand pose from single rgb images. In