I2UV-HandNet: Image-to-UV Prediction Network for Accurate and High-fidelity 3D Hand Mesh Modeling
Ping Chen, Dong Yang, Fangyin Wu, Qin Li, Qingpei Xia, Yong Tan
1 Figure 1: We propose a SR-Affine approach to reconstruct high-quality 3D hand mesh, which is composed of the UV map reconstruction network AffineNet (left) and the UV map super-resolution reconstruction network SRNet (right). AffineNet implements the 3D reconstruction of the MANO hand (low-quality), and SRNet implements the 3D reconstruction of the high-quality hand mesh. The RGB values of the UV map are the XYZ coordinates of the 3D points, and 3D mesh reconstruction can be completed via the sampling points from the UV map.
Abstract
Under various poses and heavy occlusions,3D hand model reconstruction based on a single monocular RGB image has been a challenging problem in computer vision field for many years. In this paper, we propose a SR-Affine approach for high-quality 3D hand model reconstruction. First, we propose an encoder-decoder network architecture (AffineNet) for MANO hand reconstruction. Since MANO hand is not detailed, we further propose SRNet to up-sampling point-clouds by image super-resolution on the UV map. Many experiments demonstrate that our approach is robust and outperforms the state-of-the-art methods on standard benchmarks, including the FreiHAND and HO3D datasets. Introduction
3D hand model reconstruction based on a single monocular RGB image has broadly practical value in virtual reality (VR)/augmented reality (AR) and gesture recognition. At present, the 3D hand annotations of monocular images are mainly based on synthetic datasets [1-3] or model-based optimization [4, 5]. Most previous the state-of-the-art (SOTA) approaches focus on the hand model with articulated and nonrigid deformations (MANO) [37] parameter regression method [5-8] or directly regressing the 3D point-clouds [4, 9]. However, due to factors such as the image background and the scale and position of the hand, the reconstructed mesh cannot be aligned well with the original image. Compared with UV mapping methods [11,12], these methods focus on learning sparse point-clouds. In addition, MANO hand has only 778 points/1538 faces, which cannot be used well in VR/AR scenarios that require fine models. Recently, 3D reconstruction by learning dense point-clouds achieved good results [11, 12] by mapping the 3D mesh of the human body into a UV map (in which the RGB values represent the XYZ coordinates). The UV mapping process is essentially the process of mapping each triangle on the 3D mesh to the UV plane according to a certain relationship. Through UV mapping, the sparse point-to-point regression problem can be solved as a dense face-to-face regression problem. And the UV expression of the human hand is different from the RGB image. The difference is what we call coordinates ambiguity. With UV mapping, there is a coordinates ambiguity between the RGB image and the UV map. (i.e., the position of head in the RGB image could correspond to the position of hand in the UV map). To address this problem, Previous works [11,12] need to apply preprocessing algorithms to align RGB images to the UV maps before training, which is not learnable, has preprocessing
SR-Affine: High-quality 3D hand model reconstruction from UV Maps
Ping Chen, Dong Yang, Fangyin Wu, Qin Li, Qingpei Xia and Yong Tan
IQIYI Inc. {chenping, yangdong01, wufangying, liqin01, xiaqingpei, tanyong }@qiyi.com 2 errors, and is not convenient for testing. Similarly, we realize 3D hand model reconstruction through UV mapping. Figure 2 shows the different UV unfolding forms of the same 3D hand mode. To clear up the ambiguity of coordinates of RGB image and UV map, we propose AffineNet, a novel coarse-to-fine encoder-decoder network in Figure 3. AffineNet use affine connections to align the encoded features with the UV map. AffineNet combines the encoded features aligned by affine connections and the corresponding decoded features to predict the output. The UV map predicted by AffineNet at different resolutions correspond to 3D point-clouds at different precisions. Moreover, the precision requirements, which means the number of 3D points, vary across scenarios. We call the process of increasing the number of 3D points as point-clouds up-sampling. Most approaches [4,9] to nonparametric 3D point-clouds regression are computationally expensive because they rely on complicated deep networks. These approaches often encounter the problem of nonlinear memory footprint growth, which is caused by an increasing number of points. However, we could sample a massive number of 3D points from the UV map without increasing the time and space complexity. (i.e., we could easily sample tens of thousands of points in a 256x256 UV map). We propose a novel UV map-based method called SRNet that realizes high-quality 3D hand reconstruction by converting point-clouds up-sampling to image super-resolution. We register the scanned 3D hand models with high-quality with the MANO-based hand models with low-quality and produce their UV map with the same size and mapping relationship. The UV maps from high-quality 3D hand models are different with the UV maps from low-quality MANO-based hand models in the high-frequency component. Therefore, the point-clouds up-sampling learning process involves image super-resolution learning low-quality UV maps to high-quality UV maps. In addition, as far as we know, we are the first to achieve point-clouds up-sampling with MANO. We need to obtain only a low-quality MANO-based hand model to achieve high-quality 3D hand model reconstruction.
Figure 2: The different UV (bottom) unfolding forms of the same human hand model (top). The cutting method is based on the position of the black line, and each color in the unfolded UV map corresponds to the same color in the hand image.
In summarization, the main contributions of this paper are as follows: ●
We propose a novel high-quality 3D hand reconstruction approach called SR-Affine, as shown in Figure 1. ●
We introduce a coarse-to-fine module with affine connections to resolve the coordinates ambiguity between the RGB image and the UV map. ●
We convert point-clouds up-sampling into image super-resolution learning. This method makes the point-clouds up-sampling more concise, addresses tens of thousands of point-clouds up-sampling without increasing the time and space complexity. Related work
This section introduces related work on pose estimation, model-based mesh recovery, monocular mesh reconstruction based on graph convolution networks, and others innovative.
Pose estimation
Simon et al . [13] proposed a multi-view bootstrapping method to detect hand key points in real-time on RGB images, and then found that 2D key points can be triangulated to produce 3D points. There are other approaches regressing 3D poses from 2D key points, such as [14–16, 38]. Zhou et al . [17] achieved a better probabilistic 3D pose model by fusing 2D heat maps, 3D pose projection, and image features. Pavllo et al . [18] used sequence 2D key points from a pretrained model to predict 3D relative poses and 3D global positions and constrained the unlabeled 2D pose by a reprojection loss. Iskakov et al . [19] used a learnable triangulation-based method to perform 3D pose estimation by combining the feature information of 2D 3 key points from multiple views. These pose estimation works are very worthy of recognition.
Model-based mesh recovery
Recently, several mesh recovery approaches [5-8,35] plug parametric-based models (i.e., MANO and skinned multi-person linear model (SMPL) [35]) into the end-to-end deep learning network. Boukhayma et al. [6] obtained accurate MANO parameters and camera poses from RGB images and heat maps predicted by OpenPose [10]. Moreover, the loss function is the reprojection loss between the projected 2D key points and the ground-truth 2D key points. Based on the idea of cascading and the coarse-to-fine strategy, Zhang et al. [7] introduced a module called interactive regression to obtain better results. Kolotouros et al . [8] discussed that weak methods based on predicted 2D key points would lead to mediocre image-model alignment and require a large amount of data to train the network properly. As a result, Kolotouros et al. [8] developed a self-improved loop that uses regressed estimates from the network to initialize an iterative optimization routine and then uses optimized model parameters to supervise the output of the network. To address the problems of lack of annotated real images and occlusions between the hand and the object, Hampali et al. [5] captured video sequences with several RGB-D cameras and jointly optimized the 3D hand pose over all the frames simultaneously.
GCN-based mesh recovery
The graph convolution method is effective for processing graph data. As meshes are essentially an undirected graph, graph convolution networks (GCNs) are a good way to process non-Euclidean-structured data [20, 21] and have been widely used in monocular mesh reconstruction in recent years. Wang et al . [22] trained a convolutional network to obtain features and align them with an initial ellipsoid mesh, and utilized a graph convolution on the ellipsoid mesh to obtain the target mesh whose number of points may not be the same as that of the ground-truth mesh. Ge et al . [23] proposed a multi-supervised training strategy. With the supplementary information provided by 2D and 3D key points, this work supervised the reconstructed mesh surface and successfully solved the problem of a lack of real mesh data. Kolotouros et al . [24] hypothesized that GCNs have difficulty properly regressing a large number of mesh vertices, so they put the vertices from the GCN through a parametric model to smooth the output. Kulon et al . [4] applied a spiral selection of vertex neighbors to fully utilize spatial information and achieved good performance on unseen data.
Others
In addition, there are some novel methods for 3D hand model reconstruction. Moon et al . [9] proposed the concept of a “lixel”, which replaces the commonly used heat map with three linear lines representing the x, y, and z coordinates of the mesh vertices. Moon et al . [25] used a weakly supervised method on the basis of a depth map to obtain a more high-fidelity reconstructed hand than that obtained by MANO. AffineNet
UV map generation
First, we use MAYA software [33] to unfold the MANO hand to a UV map, shown in Figure 2, and record the UV mapping corresponding relationship (3D points correspond to the 2D coordinates of the UV map), which will be used to sample 3D points from the UV maps in following processes. Then, we process each MANO hand model to align the corresponding RGB image through the same projection matrix (i.e., orthographic projection matrix). Finally, we generate the corresponding UV maps according to [12] and the corresponding mapping relationship.
Affine Network we propose an end-to-end network (AffineNet) to realize UV maps prediction from coarse-to-fine. The encoder network is a 50-layer residual neural network (ResNet-50) [36] in the experiment. As shown in Figure 3, AffineNet extracts RGB image features during encoding and predicts UV images during decoding. Since there is also coordinates ambiguity between encoding features and decoding features, we cannot directly connect them. Hence, we introduce an affine connections module, which aligns encoding features to decoding features through an affine-operation before connecting encoding features and decoding features. And the affine-operation, similar to the STN [26], is based on the 2D projection of each vertex coordinate in the currently predicted UV map. Therefore, it can be seen from Formula (1) and Formula (2) that the elimination of coordinates ambiguity between encoding features and decoding features and the prediction of UV maps are complementary and coarse-to-fine. In detail: 4 !𝐸 ! = 𝑓 " (𝑓 $% (𝜋(𝐼 &’!() ), 𝐸 !() ))𝐼 &’! = 𝑓 %*+ (𝐸 ! , 𝐷 ! , 𝑓 " (𝐼 &’!() )) , 𝑖 = 0,1,2,3 (1) and !𝐷 , = 𝑓 %*+ (𝑓 " (𝐸 - ))𝐼 &’, = 𝑓 %*+ (𝐷 , ) (2) where 𝐸 ! is a feature encoded at the i th pyramid level. 𝑓 " (𝑥) is a 2 × up-sampling operation. 𝜋(𝐼 &’ ) represents the projection of the 3D coordinates in the UV map to the 2D plane through the projection matrix mentioned in section 3.1. 𝑓 $% (𝑥, 𝑦) represents the affine connection operation, which means that x performs an affine transformation on y, and 𝐸 ! is the UV-aligned feature by affine transformation. In addition, the aligned encoding features 𝐸 ! contains more hand features than 𝐸 ! , which can reduce the influence of background, making AffineNet more robust and stable. 𝐷 ! is a feature decoded at the i th pyramid level. 𝑓 %*+ (𝑥, 𝑦, 𝑧) indicates that the UV map is generated by x, y, and z. Note that the smaller the i, the greater the resolution. During testing, the input of the network is a RGB image with size of 256x256, and the output is a UV map with size of 256x256. The 3D model is obtained by sampling on the UV map. AffineNet Loss
The AffineNet loss function comprises three terms: 𝐿 $..!+/ = 𝜆 ) 𝐿 &’ + 𝜆 𝐿 + 𝜆 𝐿 (3) UV loss:
We use the L1 loss to account for UV reconstruction: 𝐿 &’ = |(𝐼 &’ − 𝐼 &’∗ ) ⋅ 𝑀| ) (4) where 𝐼 &’∗ is the ground-truth UV map and 𝐼 &’ is the reconstructed UV map. 𝑀 is UV map mask, where 𝑀(𝑖, 𝑗) = 0 are not mapped from the 3D point-clouds, their RGB values in the corresponding UV map are (0,0,0).
UV gradient loss:
The UV map can essentially be seen as a mapping of each triangle on the 3D model to a 2D plane without overlapping. The values on the corresponding triangular faces in the UV map should be continuous. Therefore, we introduce the gradient loss: 𝐿 = |𝜕 (𝐼 &’ ⋅ 𝑀) − 𝜕 (𝐼 &’∗ ⋅ 𝑀)| ) + =(𝜕 : (𝐼 &’ ⋅ 𝑀) − 𝜕 : (𝐼 &’∗ ⋅ 𝑀))= ) (5) where 𝜕 and 𝜕 : are gradients along the x-axis and y-axis, respectively. Sampling loss:
We use the L1 loss to account for 3D model reconstruction: 𝐿 = |𝑓 (𝑅, 𝐼 &’ ⋅ 𝑀) − 𝑓 (𝑅, 𝐼 &’∗ ⋅ 𝑀)| ) (6) where 𝑓 (𝑥, 𝑦) samples the 3D coordinates in y depending Figure 3: AffineNet network framework diagram. AffineNet is an encoder-decoder network. During decoding, the generation process of the UV map is based on affine connections and coarse-to-fine modules. on the pixel locations in x. R is the mapping relationship from 3D to 2D in the UV map generation stage. During training, optimization on 𝐿 $..!+/ is utilized on the four stages with the largest resolution( i =0,1,2,3), the projection matrix is an orthographic projection matrix, 𝜆 ; = 1, (𝑘 = 1,2,3) , and the loss ratio at each stage is 1. SRNet
Data Handling
According to the edge-based unpooling method proposed in [22], we up sample the original MANO model from 778 points/1538 faces to 3093 points/6152 triangles (1538 valid faces). Then, the iterative closest point (ICP) algorithm is used to register the 3D point-clouds collected by a high-quality scanner with the up sampled 3D point-clouds to generate a ground-truth 3D 5 model containing 3093 points/6152 faces. We generate one UV map from the up-sampled 3D model with 1538 faces and another from the ground-truth 3D model with 6152 faces. We call the former the low-quality UV map and the latter the high-quality UV map.
Super-Resolution Loss
We convert point-clouds up sampling to image super-resolution by the UV map. The input of SRNet is the low-quality UV map, and the high-quality UV map is label. Point up sampling from 778 to 3093 points is transferred to learn image super-resolution. The super-resolution loss is shown below: 𝐿 = |𝐼 &’ − 𝐼 &’∗ | ) + |𝑓 (𝑅, 𝐼 &’ ) − 𝑓 (𝑅, 𝐼 &’∗ )| ) (7) where 𝑓 (𝑥, 𝑦) and R have been mentioned in Formula (6). During testing, we feed the output of AffineNet to SRNet to obtain a high-quality 3D model. Evaluation
In this section, we evaluate the effectiveness of our SR-Affine approach on hand benchmark datasets. Specifically, we show the experimental analysis of AffineNet on various pose, occlusion, and self-comparative learning, and the experimental verification of point-clouds up sampling to image super-resolution by SRNet.
Datasets
This section mainly introduces the hand benchmark datasets: FreiHAND [27], HO3D [5], ObMan [1], and YouTube-3D-Hands (YT-3D) [4].
FreiHAND:
FreiHAND contains real hand data with different poses and varying lighting. FreiHAND contains 130,240 training samples, which contain RGB images, MANO-based 3D hand models, and camera poses. And there only are 3960 RGB images in the testing dataset. Therefore, the results need to be uploaded to the FreiHAND Competition [27] to perform the online test.
HO3D:
HO3D is a recently released dataset that has difficult poses with object interactions. The objects are mainly from the TCB-Video dataset [28]. HO3D contains 66,034 training samples, which contain RGB images, MANO parameters, and various camera poses. In the testing dataset, 11,524 RGB images only have annotations on the bounding box. The results need to be uploaded to the HO3D Competition [5] to perform the online test.
ObMan:
ObMan is a synthetic dataset with object interactions that contains 141,550 training samples, 6463 validation samples, and 6285 testing samples.
YouTube-3D-Hands (YT-3D):
The dataset images are from YouTube.com, and it has a variety of real hands in real-world situations. The training set is generated from 102 videos, resulting in 47,125 hand annotations. The validation and test sets cover 7 videos and contain 1525 samples each.
SuperHandScan (SHS):
We build the high-quality 3D hand dataset using a laser scanner. It has 6000 samples with 3D point-clouds. The high-quality hand models in
SHS will be used to align to the MANO hand models. The
SHS dataset lacking RGB images can only be used for SRNet training.
Hand mesh modeling via AffineNet
Experimental details and metrics:
We use FreiHAND, ObMan (synthetic dataset) and YT-3D to train our model in PyTorch [31]. The training of AffineNet is simple without any tricks. AffineNet is trained on 4 GPUs (V100) with 32 images per GPU using the Adam optimizer and the cosine learning rate schedule. The initial learning rate is set to 1e-4. During training, we perform standard augmentations by scaling the images, rotating the images and randomly permutating the color channels. To evaluate the results fairly, we follow the metrics [2, 27, 32] used in the FreiHAND Competition:
Pose error & PCK:
The average Euclidean distance between the predicted and ground-truth 3D poses.
Mesh error & PCV:
The average Euclidean distance between the predicted and ground-truth meshes.
AUC:
The percentage of correct key points (PCK/PCV) curve in an interval from 0 cm to 5 cm with 100 equally spaced thresholds.
F-score:
The harmonic mean of the recall and precision between two meshes given a distance threshold.
Comparison with the SOTA methods on various poses:
We mainly compared ours with the SOTA work [4] for the evaluation of FreiHAND. We use the same datasets used in [4], including FreiHAND, TYT-3D and synthetic dataset, to train our AffineNet. To verify the robustness and effectiveness of our method, we also evaluate our AffineNet performance through the online FreiHAND Competition. As shown in Table 1, our method significantly outperforms the current SOTA methods in all evaluation indicators. Figure 4 further shows that our method is more stable in the evaluation process. It is worth mentioning that up to now, our various results are also SOTA on the FreiHAND Competition Leaderboard.
Comparison with the SOTA methods on occlusion:
To quickly verify the reconstruction performance in various occlusion scenarios, we use the same model evaluate on FreiHAND to evaluate results on HO3D. Similar to the FreiHAND evaluation, we compare the evaluation results on the HO3D Competition Leaderboard. As shown in Table 2 (all of the results are from HO3D Competition Leaderboard), our 6 method achieves a good performance in various occlusion even in scenarios with heavy occlusion. The pose error is 0.03 lower than the test dataset provider [5]. It is worth mentioning that our AffineNet also achieve SOTA on joint evaluation on the HO3D Competition Leaderboard without using HO3D dataset for training, which shows that our AffineNet is robust and stable.
Table 1: Comparison of the 3D reconstruction of human hands in FreiHAND with multiple poses in real scenes. ↓ means a lower value is better, and ↑ means a higher value is better. Method
Pose Error ↓ Pose AUC ↑ Mesh Error ↓ Mesh AUC ↑ F@5 mm ↑ F@15 mm ↑ Boukhayma et al. [6] et al . [29] 1.33 0.737 1.33 0.736 0.429 0.907 Zimm et al. [27] 1.1 0.783 1.09 0.783 0.516 0.934 Kulon et al . [4] 0.84 0.834 0.86 0.83 0.614 0.966 Choi et al. [30] 0.77 - 0.78 - 0.674 0.969 Moon et al. [9] 0.74 0.854 0.76 0.850 0.681 0.973
Ours (AffineNet)
Figure 4: Comparison of 3D PCK and 3D PCV in FreiHAND. We compare with [27] (yellow), [4] (blue) and [9] (green). Table 2: Comparison of the 3D reconstruction of human hands in HO3D with varying occlusions in real scenes.
Method
Pose Error ↓ Pose AUC ↑ Mesh Error ↓ Mesh AUC ↑ F@5 mm ↑ F@15 mm ↑ Hasson et al . [29] et al . [1] et al . [5]
Ours (AffineNet)
Table 3: The results comparison of different UV forms in different stages, S i means the i th pyramid level output 𝐼 &’! . Variants of our Method
AUC of PCK ↑ AUC of PCV ↑ UV1+S0
UV3+S0
Self-Comparative learning:
Besides the comparison with the SOTA methods, we also analyze the forms of UV unfolding and the coarse-to-fine strategy. All the experiments are on the FreiHAND testing dataset.
Analysis of different UV unfolding forms : As shown in Figure 2, we use MAYA software to unfold the MANO model from left to right according to the positions of the black lines and mark them as UV , UV and UV . We conduct an experimental analysis on different UV unfolding forms, and the results are shown in Table 3. Different UV unfolding forms have a certain impact on the output of the system. We found that 7 AffineNet using UV had the worst performance and UV was better than UV . UV is unfolded according to the proportion of the area in each area of the hand. UV and UV are unfolded according to the proportion of the number of points in each area of the hand but with different finger positions. We believe that the features of the five fingertips have a certain inherent connection. Additionally, based on the property of the convolution operation, the network may be easier to train when the five fingers are grouped together. Analysis of the coarse-to-fine strategy : With the unfolding form, we evaluate the UV maps at each stage. The results are shown in Table 3. As the resolution of the UV map increases, the pose AUC and the mesh AUC increase, indicating that the reconstruction results are improving. Note that S0 is the full resolution and S2 is the 1/4 of the full resolution.
High-quality hand modeling via SR-Affine
Cascading AffineNet and SRNet, we propose a novel SR-Affine approach for monocular high-quality 3D hand model reconstruction. In section 4, we have used the low-quality UV map as the input of SRNet and adopt the L1 loss between the output of SRNet and the high-quality UV map.
Figure 5: Qualitative mesh reconstruction results shown on HO3D (left half) and FreiHAND (right half).
Base is the reconstruction result from AffineNet,
Quality is the high-quality result from SRNet, and
View B is another view of
Quality . SHS dataset with the Adam optimizer on 4 GPUs. The batch size is set to 512, and the initial learning rate is set to 1e-3. The input UV map size of SRNet is 256x256, and the output UV map size is 256x256. During training, we perform standard augmentations by randomly rotating the UV maps. To make the network has a better generalizability, we also randomly process the high-quality UV map into a low-quality UV map as the input of the network by Gaussian smoothing. During testing, we use the UV map output from AffineNet as the input of SRNet to implement high-quality reconstruction. Figure 5 shows the high-quality 3D hand reconstruction results under various poses and occlusion conditions by the SR-Affine approach. Based on the comparison of
Base and
Quality in Figure 5, it can be seen that the 3D model generated from the high-quality UV map is obviously better than that from the low-quality UV map, especially in details such as creases and skin bulging, as well as holistic hand poses. In fact, SRNet is only an attempt of high-fidelity 3D reconstruction. Similar to [25], there is no label for quantitative evaluation, and only subjective evaluation similar to GAN can be performed. Conclusion
In this paper, we propose a novel SR-Affine approach for monocular high-quality 3D hand model reconstruction. The approach contains a coarse-to-fine module by AffineNet and point-clouds up sampling module by SRNet. We propose a coarse-to-fine module with affine connections to resolve the coordinates ambiguity between the RGB image and the UV map, thus ranking first in the hand benchmark FreiHAND and HO3D competitions. In addition, we are the first to propose point-clouds up sampling for 3D model refinement by image super-resolution. A large number of benchmark experiments have been performed to verify the best robustness and effectiveness of our SR-Affine. In the future, we will continue to improve the performance of our work. Considering that the ground-truth meshes collected by the 3D scanner inevitably have incomplete parts, we will convert the incomplete reconstruction of the 3D model to image inpainting to repair the incomplete 3D model on the UV map.
References [1] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, et al ., "Learning joint reconstruction of hands and manipulated objects," in Proc. IEEE Conf. Comput. Vision Pattern Recognition, Long Beach, CA: IEEE, 2019, pp. 11807–11816. [2] C. Zimmermann and T. Brox, "Learning to estimate 3D hand pose from single rgb images," in Proc. IEEE Int. Conf. Comput. Vision, Venice, Italy: IEEE, 2017, pp. 4903–4911. [3] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, et al ., "GANerated hands for real-time 3D hand tracking from monocular RGB," in 2018 IEEE/CVF Conf. Comput. Vision Pattern Recognition, Salt Lake City, UT: IEEE, 2018, pp. 49–59. [4] D. Kulon, R. A. Güler, I. Kokkinos, M. Bronstein, and S. Zafeiriou, "Weakly-supervised mesh-convolutional hand reconstruction in the wild," in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition, Seattle, WA: IEEE, 2020, pp. 4990–5000. [5] S. Hampali, M. Rad, M. Oberweger, and V. Lepetit, "HOnnotate: A method for 3D annotation of hand and object poses," in 2020 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Seattle, WA: IEEE, 2020, pp. 3193–3203. [6] A. Boukhayma, R. De Bem, and P. H. S. Torr, "3D hand shape and pose from images in the wild," in 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Long Beach, CA: IEEE, 2019, pp. 10835–10844. [7] X. Zhang, Q. Li, H. Mo, W. Zhang, and W. Zheng, "End-to-end hand mesh recovery from a monocular RGB image," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 2354–2364. [8] N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, "Learning to reconstruct 3D human pose and shape via model-fitting in the loop," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 2252–2261. [9] G. Moon and K. M. Lee, "I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image," arXiv preprint arXiv:2008.03713, August 2020. [10] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, "OpenPose: Realtime multi-person 2D pose estimation using part affinity fields," arXiv preprint arXiv:1812.08008, December 2018. [11] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor, "Tex2Shape: Detailed full human body geometry from a single image," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 2293–2303. [12] P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, "Densebody: Directly regressing dense 3D human pose and shape from a single color image," arXiv preprint arXiv:1903.10153, March 2019. [13] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, "Hand keypoint detection in single images using multiview bootstrapping," in 2017 IEEE Conf. Comput. Vision Pattern Recognition (CVPR), Honolulu, HI: IEEE, 2017, pp. 4645–4653. [14] E. Remelli, S. Han, S. Honari, P. Fua, and R. Wang, "Lightweight multi-view 3D pose estimation through camera-disentangled representation," in 2020 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Seattle, WA: IEEE, 2020, pp. 6039–6048. [15] T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo, "Anatomy-aware 3D human pose estimation in videos," arXiv preprint arXiv:2002.10322, February 2020. [16] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng, "Cross view fusion for 3D human pose estimation," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 4341–4350. [17] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, "Towards 3D human pose estimation in the wild: A weakly-supervised approach," in 2017 IEEE Int. Conf. Comput. Vision (ICCV), Venice, Italy: IEEE, 2017, pp. 398–407. [18] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, "3D human pose estimation in video with temporal convolutions and semi-supervised training," in 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Long Beach, CA: IEEE, 2019, pp. 7745–7754. [19] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov, "Learnable triangulation of human pose," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 7717–7726. [20] M. Defferrard, X. Bresson, and P. Vandergheynst, "Convolutional neural networks on graphs with fast localized spectral filtering," in Adv. Neural Inf. Process. Syst., Barcelona, Spain: NIPS, 2016, pp. 3844–3852. [21] N. Verma, E. Boyer, and J. Verbeek, "FeaStNet: Feature-steered graph convolutions for 3D shape analysis," in 2018 IEEE/CVF Conf. Comput. Vision Pattern Recognition, Salt Lake City, UT: IEEE, 2018, pp. 2598–2606. [22] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. G. Jiang, "Pixel2mesh: Generating 3D mesh models from single rgb images," in Proc. Eur. Conf. Comput. Vision (ECCV), Amsterdam, Netherlands: Computer Vision Foundation, 2018, pp. 52–67. [23] L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, et al ., "3D hand shape and pose estimation from a single RGB image," in 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Long Beach, CA: IEEE, 2019, pp. 10825–10834. [24] N. Kolotouros, G. Pavlakos, and K. Daniilidis, "Convolutional mesh regression for single-image human shape reconstruction," in 2019 IEEE/CVF Conf. Comput. Vision Pattern Recognition (CVPR), Long Beach, CA: IEEE, 2019, pp. 4496–4505. [25] G. Moon, T. Shiratori, and K. M. Lee, "DeepHandMesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling," arXiv preprint arXiv:2008.08213, August 2020. [26] M. Jaderberg, K. Simonyan, and A. Zisserman, "Spatial transformer networks," in Adv. Neural Inf. Process. Syst., 2015, pp. 2017–2025. [27] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. J. Argus, and T. Brox, "FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images," in 2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, South Korea: IEEE, 2019, pp. 813–822. [28] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, "Posecnn: A convolutional neural network for 6D object pose estimation in cluttered scenes," arXiv preprint arXiv:1711.00199, November 2017. [29] Y. Hasson, B. Tekin, F. Bogo, I. Laptev, M. Pollefeys, and C. Schmid, "Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction," in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognition, Seattle, WA: IEEE, 2020, pp. 571–580. [30] H. Choi, G. Moon, and K. Lee, "Pose2Mesh: Graph convolutional network for 3D human pose and mesh recovery from a 2D human pose," arXiv preprint arXiv:2008.09047, August 2020. [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, et alet al