Cross View Fusion for 3D Human Pose Estimation
Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, Wenjun Zeng
CCross View Fusion for D Human Pose Estimation
Haibo Qiu ∗ University of Science and Technology of China [email protected]
Chunyu WangMicrosoft Research Asia [email protected]
Jingdong WangMicrosoft Research Asia [email protected]
Naiyan WangTuSimple [email protected]
Wenjun ZengMicrosoft Research Asia [email protected]
Abstract
We present an approach to recover absolute D humanposes from multi-view images by incorporating multi-viewgeometric priors in our model. It consists of two sepa-rate steps: (1) estimating the D poses in multi-view im-ages and (2) recovering the D poses from the multi-view D poses. First, we introduce a cross-view fusion schemeinto CNN to jointly estimate D poses for multiple views.Consequently, the D pose estimation for each view al-ready benefits from other views. Second, we present a re-cursive Pictorial Structure Model to recover the D posefrom the multi-view D poses. It gradually improves theaccuracy of D pose with affordable computational cost.We test our method on two public datasets H36M andTotal Capture. The Mean Per Joint Position Errors onthe two datasets are mm and mm, which outperformsthe state-of-the-arts remarkably ( mm vs mm, mmvs mm). Our code is released at https:// github.com/microsoft/ multiview-human-pose-estimation-pytorch.
1. Introduction
The task of D pose estimation has made significantprogress due to the introduction of deep neural networks.Most efforts [16, 13, 33, 17, 23, 19, 29, 28, 6] have been de-voted to estimating relative D poses from monocular im-ages. The estimated poses are centered around the pelvisjoint thus do not know their absolute locations in the envi-ronment (world coordinate system).In this paper, we tackle the problem of estimating abso-lute D poses in the world coordinate system from multiplecameras [1, 15, 4, 18, 3, 20]. Most works follow the pipelineof first estimating D poses and then recovering D posefrom them. However, the latter step usually depends on the ∗ This work is done when Haibo Qiu is an intern at Microsoft ResearchAsia. performance of the first step which unfortunately often haslarge errors in practice especially when occlusion or motionblur occurs in images. This poses a big challenge for thefinal D estimation.On the other hand, using the Pictorial Structure Model(PSM) [14, 18, 3] for D pose estimation can alleviate theinfluence of inaccurate D joints by considering their spatialdependence. It discretizes the space around the root joint byan N × N × N grid and assigns each joint to one of the N bins (hypotheses). It jointly minimizes the projection errorbetween the estimated D pose and the D pose, along withthe discrepancy of the spatial configuration of joints and itsprior structures. However, the space discretization causeslarge quantization errors. For example, when the space sur-rounding the human is of size mm and N is , thequantization error is as large as mm. We could reduce theerror by increasing N , but the inference cost also increasesat O ( N ) , which is usually intractable.Our work aims to address the above challenges. First, weobtain more accurate D poses by jointly estimating themfrom multiple views using a CNN based approach. It ele-gantly addresses the challenge of finding the correspondinglocations between different views for D pose heatmap fu-sion. We implement this idea by a fusion neural network asshown in Figure 1. The fusion network can be integratedwith any CNN based D pose estimators in an end-to-endmanner without intermediate supervision.Second, we present Recursive Pictorial Structure Model(RPSM), to recover the D pose from the estimated multi-view D pose heatmaps. Different from PSM which directlydiscretizes the space into a large number of bins in order tocontrol the quantization error, RPSM recursively discretizesthe space around each joint location (estimated in the previ-ous iteration) into a finer-grained grid using a small num-ber of bins. As a result, the estimated D pose is refined stepby step. Since N in each step is usually small, the inferencespeed is very fast for a single iteration. In our experiments,1 a r X i v : . [ c s . C V ] S e p usion layer camera 1camera 2 gt heatmapdetected heatmapdetectedheatmap fused heatmapfusedheatmap gtheatmap L2 LossL2 LossL2 LossL2 Loss
Figure 1. Cross-view fusion for D pose estimation. The imagesare first fed into a CNN to get initial heatmaps. Then the heatmapof each view is fused with the heatmaps from other views througha fusion layer. The whole network is learned end-to-end.
RPSM decreases the error by at least compared to PSMwith little increase of inference time.For D pose estimation on the H36M dataset [11], theaverage detection rate over all joints improves from to . The improvement is significant for the most challeng-ing “wrist” joint. For D pose estimation, changing PSM toRPSM dramatically reduces the average error from mmto mm. Even compared with the state-of-the-art methodwith an average error mm, our approach also cuts the er-ror in half. We further evaluate our approach on the TotalCapture dataset [27] to validate its generalization ability. Itstill outperforms the state-of-the-art [26].
2. Related Work
We first review the related work on multi-view D poseestimation and discuss how they differ from our work. Thenwe discuss some techniques on feature fusion.
Multi-view D Pose Estimation
Many approaches [15,10, 4, 18, 3, 19, 20] are proposed for multi-view pose esti-mation. They first define a body model represented as sim-ple primitives, and then optimize the model parameters toalign the projections of the body model with the image fea-tures. These approaches differ in terms of the used imagefeatures and optimization algorithms.We focus on the Pictorial Structure Model (PSM) whichis widely used in object detection [8, 9] to model the spa-tial dependence between the object parts. This technique isalso used for D [32, 5, 1] and D [4, 18] pose estimationwhere the parts are the body joints or limbs. In [1], Amin et al . first estimate the D poses in a multi-view setup withPSM and then obtain the D poses by direct triangulation.Later Burenius et al . [4] and Pavlakos et al . [18] extendPSM to multi-view D human pose estimation. For exam-ple, in [18], they first estimate D poses independently foreach view and then recover the D pose using PSM. Ourwork differs from [18] in that we extend PSM to a recursive 𝑌 𝑃𝑢 𝑌 𝑃𝑣 𝑃 𝐼
Figure 2. Epipolar geometry: an image point Y uP back-projects toa ray in 3D defined by the camera C u and Y uP . This line is imagedas I in the camera C v . The 3D point P which projects to Y uP mustlie on this ray, so the image of P in camera C v must lie on I . version, i.e . RPSM, which efficiently refines the 3D pose es-timations step by step. In addition, they [18] do not performcross-view feature fusion as we do. Multi-image Feature Fusion
Fusing features from dif-ferent sources is a common practice in the computer visionliterature. For example, in [34], Zhu et al . propose to warpthe features of the neighboring frames (in a video sequence)to the current frame according to optical flow in order to ro-bustly detect the objects. Ding et al . [7] propose to aggre-gate the multi-scale features which achieves better segmen-tation accuracy for both large and small objects. Amin etal . [1] propose to estimate D poses by exploring the geo-metric relation between multi-view images. It differs fromour work in that it does not fuse features from other views toobtain better D heatmaps. Instead, they use the multi-view D geometric relation to select the joint locations from the“imperfect” heatmaps. In [12], multi-view consistency isused as a source of supervision to train the pose estimationnetwork. To the best of our knowledge, there is no previouswork which fuses multi-view features so as to obtain better D pose heatmaps because it is a challenging task to findthe corresponding features across different views which isone of our key contributions of this work.
3. Cross View Fusion for D Pose Estimation
Our D pose estimator takes multi-view images as input,generates initial pose heatmaps respectively for each, andthen fuses the heatmaps across different views such that theheatmap of each view benefits from others. The processis accomplished in a single CNN and can be trained end-to-end. Figure 1 shows the pipeline for two-view fusion.Extending it to multi-views is trivial where the heatmap ofeach view is fused with the heatmaps of all other views.The core of our fusion approach is to find the correspondingfeatures between a pair of views.Suppose there is a point P in D space. See Figure 2. Itsprojections in view u and v are Y uP ∈ Z u and Y vP ∈ Z v ,respectively where Z u and Z v denote all pixel locations inthe two views, respectively. The heatmaps of view u and .01.00.6 0.6 0.7-0.3 -0.4 -0.4-0.5-0.4 -0.8 -0.2 -1.0 -0.3-0.1 Figure 3. Two-view feature fusion for one channel. The top griddenotes the feature map of view A . Each location in view A is con-nected to all pixels in view B by a weight matrix. The weights aremostly positive for locations on the epipolar line (numbers in theyellow cells). Different locations in view A have different weightsbecause they correspond to different epipolar lines. v are F u = { x u , · · · , x u |Z u | } and F v = { x v , · · · , x v |Z v | } .The core idea of fusing a feature in view u , say x ui , with thefeatures from F v is to establish the correspondence betweenthe two views: x ui ← x ui + |Z v | (cid:88) j =1 ω j,i · x vj , ∀ i ∈ Z u , (1)where ω j,i is a to be determined scalar. Ideally, for a specific i , only one ω j,i should be positive, while the rest are zero.Specifically, ω j,i is positive when the pixel i in view u andpixel j in view v correspond to the same D point.Suppose we know only Y uP , how can we find the corre-sponding point Y vP in the image of a different view? Weknow Y vP is guaranteed to lie on the epipolar line I . Butsince we do not know the depth of P , which means it maymove on the line defined by C u and Y uP , we cannot deter-mine the exact location of Y vP on I . This ambiguity poses achallenge for the cross view fusion.Our solution is to fuse x ui with all features on the line I . This may sound brutal at the first glance, but is in factelegant. Since fusion happens in the heatmap layer, ideally, x vj should have large response at Y vP (the cyan point) andzeros at other locations on the epipolar line I . It means thenon-corresponding locations on the line will contribute noor little to the fusion. So fusing all pixels on the epipolarline is a simple yet effective solution. The feature fusion rule (Eq. (1)) can be interpreted as afully connected layer imposed on each channel of the poseheatmaps where ω are the learnable parameters. Figure 3illustrates this idea. Different channels of the feature maps,which correspond to different joints, share the same weightsbecause the cross view relations do not depend on the jointtypes but only depend on the pixel locations in the camera views. Treating feature fusion as a neural network layerenables the end-to-end learning of the weights.We investigate two methods to train the network. In thefirst approach, we clip the positive weights to zero duringtraining if the corresponding locations are off the epipolarline. Negative weights are allowed to represent suppressionrelations. In the second approach, we allow the networkto freely learn the weights from the training data. The fi-nal 2D pose estimation results are also similar for the twoapproaches. So we use the second approach for training be-cause it is simpler. The learned fusion weights which implicitly encode theinformation of epipolar geometry are dependent on the cam-era configurations. As a result, the model trained on a par-ticular camera configuration cannot be directly applied toanother different configuration.We propose an approach to automatically adapt ourmodel to a new environment without any annotations. Weadopt a semi-supervised training approach following theprevious work [21]. First, we train a single view 2D poseestimator [31] on the existing datasets such as MPII whichhave ground truth pose annotations. Then we apply thetrained model to the images captured by multiple camerasin the new environment and harvest a set of poses as pseudolabels. Since the estimations may be inaccurate for someimages, we propose to use multi-view consistency to filterthe incorrect labels. We keep the labels which are consis-tent across different views following [21]. In training thecross view fusion network, we do not enforce supervisionon the filtered joints. We will evaluate this approach in theexperiment section.
4. RPSM for Multi-view D Pose Estimation
We represent a human body as a graphical model with M random variables J = { J , J , · · · , J M } in which eachvariable corresponds to a body joint. Each variable J i de-fines a state vector J i = [ x i , y i , z i ] as the D position of thebody joint in the world coordinate system and takes its valuefrom a discrete state space. See Figure 4. An edge betweentwo variables denotes their conditional dependence and canbe interpreted as a physical constraint.
Given a configuration of D pose J and multi-view Dpose heatmaps F , the posterior becomes [3]: p ( J |F ) = 1 Z ( F ) M (cid:89) i =1 φ conf i ( J i , F ) (cid:89) ( m,n ) ∈E ψ limb ( J m , J n ) , (2) igure 4. Graphical model of human body used in our experiments.There are variables and edges. where Z ( F ) is the partition function and E are the graphedges as shown in Figure 4. The unary potential func-tions φ conf i ( J i , F ) are computed based on the previouslyestimated multi-view D pose heatmaps F . The pairwisepotential functions ψ limb ( J m , J n ) encode the limb lengthconstraints between the joints. Discrete state space
We first triangulate the D locationof the root joint using its D locations detected in all views.Then the state space of the D pose is constrained to bewithin a D bounding volume centered at the root joint. Theedge length s of the volume is set to be mm. The vol-ume is discretized by an N × N × N grid G . All body jointsshare the same state space G which consists of N discretelocations (bins). Unary potentials
Every body joint hypothesis, i.e . a binin the grid G , is defined by its D position in the world coor-dinate system. We project it to the pixel coordinate systemof all camera views using the camera parameters, and getthe corresponding joint confidence from F . We computethe average confidence over all camera views as the unarypotential for the hypothesis. Pairwise potentials
Offline, for each pair of joints( J m , J n ) in the edge set E , we compute the average dis-tance ˜ l m,n on the training set as limb length priors. Duringinference, the pairwise potential is defined as: ψ limb ( J m , J n ) = (cid:26) , if l m,n ∈ [ ˜ l m,n − (cid:15), ˜ l m,n + (cid:15) ]0 , otherwise , (3)where l m,n is the distance between J m and J n . The pair-wise term favors D poses having reasonable limb lengths.In our experiments, (cid:15) is set to be mm.
Inference
The final step is to maximize the posterior (Eq.(2)) over the discrete state space. Because the graph isacyclic, it can be optimized by dynamic programming with 𝐿 𝑚 𝐿 𝑛 𝐺 𝑖(𝑚) 𝐺 𝑗(𝑛) Figure 5. Illustration of the recursive pictorial structure model.Suppose we have estimated the coarse locations L m and L n forthe two joints J m and J n , respectively, in the previous iteration.Then we divide the space around the two joints into finer-grainedgrids and estimate more precise locations. global optimum guarantee. The computational complexityis of the order of O ( N ) . The PSM model suffers from large quantization errorscaused by space discretization. For example, when we set N = 32 as in the previous work, the quantization error isas large as mm ( i.e . s × where s = 2000 is the edgelength of the bounding volume). Increasing N can reducethe quantization error, but the computation time quickly be-comes intractable. For example, if N = 64 , the inferencespeed will be
64 = ( ) times slower.Instead of using a large N in one iteration, we proposeto recursively refine the joint locations through a multiplestage process and use a small N in each stage. In thefirst stage ( t = 0 ), we discretize the D bounding vol-ume space around the triangulated root joint using a coarsegrid ( N = 16 ) and obtain an initial D pose estimation L = ( L , · · · , L M ) using the PSM approach.Fo the following stages ( t ≥ ), for each joint J i , wediscretize the space around its current location L i into an × × grid G ( i ) . The space discretization here differsfrom PSM in two-fold. First, different joints have their owngrids but in PSM all joints share the same grid. See Figure5 for illustration of the idea. Second, the edge length ofthe bounding volume decreases with iterations: s t = s t − N .That is the main reason why the grid becomes finer-grainedcompared to the previous stage.Instead of refining each joint independently, we simul-taneously refine all joints considering their spatial relations.Recall that we know the center locations, sizes and the num-ber of bins of the grids. So we can calculate the location ofevery bin in the grids with which we can compute the unaryand pairwise potentials. It is worth noting that the pairwisepotentials should be computed on the fly because it dependson the previously estimated locations. However, becausewe set N to be a small number (two in our experiments),this computation is fast. .3. Relation to Bundle Adjustment [25] Bundle adjustment [25] is also a popular tool for refin-ing D reconstructions. RPSM differs from it in two as-pects. First, they reach different local optimums due totheir unique ways of space exploration. Bundle adjustmentexplores in an incremental way while RPSM explores in adivide and conquer way. Second, computing gradients byfinite-difference in bundle adjustment is not stable becausemost entries of heatmaps are zeros.
5. Datasets and Metrics
The H36M Dataset [11]
We use a cross-subject evalua-tion scheme where subjects , , , , are used for trainingand , for testing. We train a single fusion model forall subjects because their camera parameters are similar. Insome experiments (which will be clearly stated), we alsouse the MPII dataset [2] to augment the training data. Sincethis dataset only has monocular images, we do not train thefusion layer on these images. The Total Capture Dataset [27] we also evaluate ourapproach on the Total Capture dataset to validate its gen-eral applicability to other datasets. Following the previ-ous work [27], the training set consists of “ROM1,2,3”,“Walking1,3”, “Freestyle1,2”, “Acting1,2”, “Running1” onsubjects 1,2 and 3. The testing set consists of “Freestyle3(
FS3 )”, “Acting3 ( A3 )” and “Walking2 ( W2 )” on subjects1,2,3,4 and 5. We use the data of four cameras (1,3,5,7) inexperiments. We do not use the IMU sensors. We do notuse the MPII dataset for training in this experiment. Thehyper-parameters for training the network are kept the sameas those on the H36M dataset. Metrics
The D pose estimation accuracy is measuredby Joint Detection Rate (JDR). If the distance between theestimated and the groundtruth locations is smaller than athreshold, we regard this joint as successfully detected. Thethreshold is set to be half of the head size as in [2]. JDR isthe percentage of the successfully detected joints.The D pose estimation accuracy is measured by MeanPer Joint Position Error (MPJPE) between the groundtruth D pose y = [ p , · · · , p M ] and the estimated D pose ¯ y =[ ¯ p , · · · , ¯ p M ] : MPJPE = M (cid:80) Mi =1 (cid:107) p i − ¯ p i (cid:107) We do notalign the estimated D poses to the ground truth. This isreferred to as protocol 1 in [16, 24]
6. Experiments on D Pose Estimation
We adopt the network proposed in [31] as our base net-work and use ResNet-152 as its backbone, which was pre-trained on the ImageNet classification dataset. The input
Table 1. This table shows the D pose estimation accuracy on theH36M dataset. “+MPII” means we train on “H36M+MPII”. Weshow JDR (%) for six important joints due to space limitation. Method TrainingDataset Shlder Elb Wri Hip Knee Ankle
Single
H36M 88.50 88.94 85.72 90.37 94.04 90.11
Sum
H36M 91.36 91.23 89.63 96.19 94.14 90.38
Max
H36M 92.67 92.45 91.57 97.69 95.01 91.88
Ours
H36M
Single +MPII 97.38 93.54 89.33 99.01 95.10 91.96
Ours +MPII
Table 2. This table shows the D pose estimation error MPJPE( mm ) on H36M when different datasets are used for training.“+MPII” means we use a combined dataset “H36M+MPII” fortraining. D poses are obtained by direct triangulation.
Method TrainingDataset Shlder Elb Wri Hip Knee Ankle
Single
H36M 59.70 89.56 313.25 69.35 76.34 120.97
Ours
H36M
Single +MPII 30.82 38.32 64.18 24.70 38.38 62.92
Ours +MPII
ImageGTHeatmapFused HeatmapDetected HeatmapWarped Heatmap
Figure 6. Sample heatmaps of our approach. “Detected heatmap”denotes it is extracted from the image of the current view. “Warpedheatmap” is obtained by summing the heatmaps warped from theother three views. We fuse the “warped heatmap” and the “de-tected heatmap” to obtain the “fused heatmap”. For challeng-ing images, the “detected heatmaps” may be incorrect. But the“warped heatmaps” from other (easier) views are mostly correct.Fusing the multi-view heatmaps improves the heatmap quality. image size is × and the resolution of the heatmapis × . We use heatmaps as the regression targets andenforce l loss on all views before and after feature fusion.We train the network for epochs. Other hyper-parameterssuch as learning rate and decay strategy are kept the sameas in [31]. Using a more recent network structure [22] gen-erates better D poses. .2. Quantitative Results
Table 1 shows the results on the most important jointswhen we train, either only on the H36M dataset, or on acombination of the H36M and MPII datasets. It comparesour approach with the baseline method [31], termed
Single ,which does not perform cross view feature fusion. We alsocompare with two baselines which compute sum or maxvalues over the epipolar line using the camera parameters.The hyper parameters for training the two methods are keptthe same for fair comparison.Our approach outperforms the baseline
Single on allbody joints. The improvement is most significant for thewrist joint, from . to . , and from . to . , when the model is trained only on “H36M” or on“H36M + MPII”, respectively. We believe this is because“wrist” is the most frequently occluded joint and cross viewfusion fuses the features of other (visible) views to help de-tect them. See the third column of Figure 6 for an example.The right wrist joint is occluded in the current view. So thedetected heatmap has poor quality. But fusing the featureswith those of other views generates a better heatmap. Inaddition, our approach outperforms the sum and max base-lines. This is because the heatmaps are often noisy espe-cially when occlusion occurs. Our method trains a fusionnetwork to handle noisy heatmaps so it is more robust thangetting sum/max values along epipolar lines.It is also interesting to see that when we only use theH36M dataset for training, the Single baseline achieves verypoor performance. We believe this is because the limitedappearance variation in the training set affects the general-ization power of the learned model. However, our fusionapproach suffers less from the lack of training data. This isprobably because the fusion approach requires the featuresextracted from different views to be consistent following ageometric transformation, which is a strong prior to reducethe risk of over-fitting to the training datasets with limitedappearance variation.The improved D pose estimations in turn help signif-icantly reduce the error in D. We estimate D poses bydirect triangulation in this experiment. Table 2 shows the D estimation errors on the six important joints. The er-ror for the wrist joint (which gets the largest improvementin D estimation) decreases significantly from . mm to . mm. The improvement on the ankle joint is also aslarge as mm. The mean per joint position error over all joints (see (c) and (g) in Table 3) decreases from . mmto . mm when we do not align the estimated D pose tothe ground truth.
In addition to the above numerical results, we also qual-itatively investigate in what circumstance our approach willimprove the D pose estimations over the baseline. Figure 6 shows four examples. First, in the fourth example (col-umn), the detected heatmap shows strong responses at bothleft and right elbows because it is hard to differentiate themfor this image. From the ground truth heatmap (the sec-ond row) we can see that the left elbow is the target. Theheatmap warped from other views (fifth row) correctly lo-calizes the left joint. Fusing the two heatmaps gives betterlocalization accuracy. Second, the third column of Figure6 shows the heatmap of the right wrist joint. Because thejoint is occluded by the human body, the detected heatmapis not correct. But the heatmaps warped from the other threeviews are correct because it is not occluded there.
7. Experiments on D Pose Estimation
In the first iteration of RPSM ( t = 0 ), we divide thespace of size , mm around the estimated location of theroot joint into bins, and estimate a coarse D pose bysolving Eq. 2. We also tried to use a larger number of bins,but the computation time becomes intractable.For the following iterations where t ≥ , we divide thespace, which is of size s t = × ( t − , around each es-timated joint location into × × bins. Note that thespace size s t of each joint equals to the size of a singlebin in the previous iteration. We use a smaller number ofbins here than that of the first iteration, because it can sig-nificantly reduce the time for on-the-fly computation of thepairwise potentials. In our experiments, repeating the aboveprocess for ten iterations only takes about . seconds. Thisis very light weight compared to the first iteration whichtakes about seconds. We design eight configurations to investigate differentfactors of our approach. Table 3 shows how different fac-tors of our approach decreases the error from . mm to . mm. RPSM vs. Triangulation:
First, RPSM achieves signif-icantly smaller 3D errors than Triangulation when D poseestimations are obtained by a relatively weak model. Forinstance, by comparing the methods (a) and (b) in Table 3,we can see that, given the same D poses, RPSM signifi-cantly decreases the error, i.e . from . mm to . mm.This is attributed to the joint optimization of all nodes andthe recursive pose refinement.Second, RPSM provides marginal improvement when D pose estimations are already very accurate. For ex-ample, by comparing the methods (g) and (h) in Table 3where the D poses are estimated by our model trained onthe combined dataset (“H36M+MPII”), we can see the er-ror decreases slightly from . mm to . mm. This isbecause the input D poses are already very accurate and able 3. D pose estimation errors MPJPE ( mm ) of different methods on the H36M dataset. The naming convention of the methodsfollows the rule of “A-B-C” where “A” indicates whether we use fusion in D pose estimation. “Single” means the cross view fusion is notused. “B” denotes the training datasets. “H36M” means we only use the H36M dataset and “+MPII” means we combine H36M with MPIIfor training. “C” represents the method for estimating D poses.
Direction Discus Eating Greet Phone Photo Posing Purch(a) Single-H36M-Triangulate 71.76 65.89 56.63 136.52 59.32 96.30 46.67 110.51(b) Single-H36M-RPSM 33.38 36.36 27.13 31.14 31.06 30.28 28.59 41.03(c) Single-“+MPII”-Triangulate 33.99 32.87 25.80 29.02 34.63 26.64 28.42 42.63(d) Single-“+MPII”-RPSM 26.89 28.05 23.13 25.75 26.07 23.45 24.41 34.02(e) Fusion-H36M-Triangulate 34.84 35.78 32.70 33.49 34.44 38.19 29.66 60.72(f) Fusion-H36M-RPSM 28.89 32.46 26.58 28.14 28.31 29.34 28.00 36.77(g) Fusion-“+MPII”-Triangulate 25.15 27.85 24.25 25.45 26.16 23.70 25.68 29.66(h) Fusion-“+MPII”-RPSM 23.98 26.71 23.19 24.30 24.77 22.82 24.12 28.62Sitting SittingD Smoke Wait WalkD Walking WalkT Average(a) Single-H36M-Triangulate 150.10 57.01 73.15 292.78 49.00 48.67 62.62 94.54(b) Single-H36M-RPSM 245.52 33.74 37.10 35.97 29.92 35.23 30.55 47.82(c) Single-“+MPII”-Triangulate 88.69 36.38 35.48 31.98 27.43 32.42 27.53 36.28(d) Single-“+MPII”-RPSM 39.63 29.26 29.49 27.25 25.07 27.82 24.85 27.99(e) Fusion-H36M-Triangulate 53.10 35.18 40.97 41.57 31.86 31.38 34.58 38.29(f) Fusion-H36M-RPSM 41.98 30.54 35.59 30.03 28.33 30.01 30.46 31.17(g) Fusion-“+MPII”-Triangulate 40.47 28.60 32.77 26.83 26.00 28.56 25.01 27.90(h) Fusion-“+MPII”-RPSM 32.12 26.87 30.98 25.56 25.02 28.07 24.37 26.21
Table 4. D pose estimation errors when different numbers ofiterations t are used in RPSM. When t = 0 , RPSM is equiv-alent to PSM. “+MPII” means we use the combined dataset“H36M+MPII” to train the 2D pose estimation model. The MPJPE( mm ) are computed when no rigid alignment is performed be-tween the estimated pose and the ground truth. Methods t = 0 (cid:63) t = 1 t = 3 t = 5 t = 10 Single-H36M-RPSM 95.23 77.95 51.78 47.93 47.82Single-“+MPII”-RPSM 78.67 58.94 32.39 28.04 27.99Fusion-H36M-RPSM 80.77 61.11 35.75 31.25 31.17Fusion-“+MPII”-RPSM 77.28 57.22 30.76 26.26 26.21 direct triangulation gives reasonably good D estimations.But if we focus on some difficult actions such as “sitting”,which gets the largest error among all actions, the improve-ment resulted from our RPSM approach is still very signifi-cant (from . mm to . mm).In summary, compared to triangulation, RPSM obtainscomparable results when the D poses are accurate, and sig-nificantly better results when the 2D poses are inaccuratewhich is often the case in practice.
RPSM vs. PSM:
We investigate the effect of the recur-sive D pose refinement. Table 4 shows the results. First,the poses estimated by PSM, i.e . RPSM with t = 0 , havelarge errors resulted from coarse space discretization. Sec-ond, RPSM consistently decreases the error as t grows andeventually converges. For instance, in the first row of Ta-ble 4, RPSM decreases the error of PSM from . mm to . mm which validates the effectiveness of the recursive3D pose refinement of RPSM. Single vs. Fusion:
We now investigate the effect of thecross-view feature fusion on 3D pose estimation accuracy.Table 3 shows the results. First, when we use H36M+MPIIdatasets (termed as “+MPII”) for training and use triangula-tion to estimate D poses, the average D pose error of ourfusion model (g) is smaller than the baseline without fu-sion (c). The improvement is most significant for the mostchallenging “Sitting” action whose error decreases from . mm to . mm. The improvement should be at-tributed to the better D poses resulted from cross-view fea-ture fusion. We observe consistent improvement for otherdifferent setups. For example, compare the methods (a) and(e), or the methods (b) and (f).
Comparison to the State-of-the-arts:
We also compareour approach to the state-of-the-art methods for multi-viewhuman pose estimation in Table 5. Our approach outper-forms the state-of-the-arts by a large margin. First, when wetrain our approach only on the H36M dataset, the MPJPEerror is . mm which is already much smaller than theprevious state-of-the-art [24] whose error is . mm. Asdiscussed in the above sections, the improvement should beattributed to the more accurate D poses and the recursiverefinement of the D poses.
Since it is difficult to demonstrate a D pose from all pos-sible view points, we propose to visualize it by projecting itback to the four camera views using the camera parameters able 5. Comparison of the D pose estimation errors MPJPE( mm ) of the state of the art multiple view pose estimators on theH36M datasets. We do NOT use the Procrustes algorithm to alignthe estimations to the ground truth. The result of “Multi-view Mar-tinez” is reported in [24]. The four state-of-the-arts do not useMPII dataset for training. So they are directly comparable to ourresult of . mm.Methods Average MPJPEPVH-TSP [27] . mmMulti-View Martinez [16] . mmPavlakos et al . [18] . mmTome et al . [24] . mm Our approach 31.17 mm Our approach + MPII 26.21 mm (a) 20mm(b) 40mm(c) 120mm Figure 7. We project the estimated D poses back to the D imagespace and draw the skeletons on the images. Each row shows theskeletons of four camera views. We select three typical exampleswhose D MPJPE errors are , , mm, respectively. and draw the skeletons on the images. Figure 7 shows threeestimation examples. According to the 3D geometry, if the2D projections of a 3D joint are accurate for more than twoviews (including two), the 3D joint estimation is accurate.
For instance, in the first example (first row of Figure 7), the2D locations of the right hand joint in the first and fourthcamera view are accurate. Based on this, we can infer withhigh confidence that the estimated 3D location of the righthand joint is accurate.In the first example (row), although the right hand joint isoccluded by the human body in the second view (column),our approach still accurately recovers its D location due tothe cross view feature fusion. Actually, most leg joints arealso occluded in the first and third views but the correspond-ing 3D joints are estimated correctly.The second example gets a larger error of mm becausethe left hand joint is not accurately detected. This is becausethe joint is occluded in too many (three) views but only vis-ible in a single view. Cross-view feature fusion contributeslittle in this case. For most of the testing images, the DMPJPE errors are between mm to mm. Table 6. D pose estimation errors MPJPE ( mm ) of differentmethods on the Total Capture dataset. The numbers reported forour method and the baselines are obtained without rigid alignment. Methods Subjects1,2,3 Subjects4,5 MeanW2 FS3 A3 W2 FS3 A3Tri-CPM [30] 79 112 106 79 149 73 99PVH [27] 48 122 94 84 168 154 107IMUPVH [27] 30 91 49 36 112 10 70AutoEnc [26] 13 49 24 22 71 40 35Single-RPSM 28 42 30 45 74 46 41
Fusion-RPSM
19 28 21 32 54 33 There are few cases (about . ) where the error is aslarge as mm. This is usually when “double counting”happens. We visualize one such example in the last row ofFigure 7. Because this particular pose of the right leg wasrarely seen during training, the detections of the right legjoints fall on the left leg regions consistently for all views.In this case, the warped heatmaps corresponding to the rightleg joints will also fall on the left leg regions thus cannotdrag the right leg joints to the correct positions. We conduct experiments on the Total Capture datasetto validate the general applicability of our approach. Ourmodel is trained only on the Total Capture dataset. Table6 shows the results. “Single-RPSM” means we do NOTperform cross-view feature fusion and use RPSM for re-covering 3D poses. First, our approach decreases the er-ror of the previous best model [26] by about . Second,the improvement is larger for the hard cases such as “FS3”.The results are consistent with those on the H36M dataset.Third, by comparing the approaches of “Single-RPSM” and“Fusion-RPSM”, we can see that fusing the features of dif-ferent views improves the final D estimation accuracy sig-nificantly. In particular, the improvement is consistent forall different subsets.
We conduct experiments on the H36M dataset using NOpose annotations. The single view pose estimator [31] istrained on the MPII dataset. If we directly apply this modelto the test set of H36M and estimate the D pose by RPSM,the MPJPE error is about mm. If we retrain this model(without the fusion layer) using the harvested pseudo labels,the error decreases to mm. If we train our fusion modelwith the pseudo labels described above, the error decreasesto mm which is already smaller than the previous super-vised state-of-the-arts. The experimental results validate thefeasibility of applying our model to new environments with-out any manual label. . Conclusion We propose an approach to estimate D human posesfrom multiple calibrated cameras. The first contribution is aCNN based multi-view feature fusion approach which sig-nificantly improves the D pose estimation accuracy. Thesecond contribution is a recursive pictorial structure modelto estimate D poses from the multi-view D poses. It im-proves over the PSM by a large margin. The two contribu-tions are independent and each can be combined with theexisting methods.
References [1] Sikandar Amin, Mykhaylo Andriluka, Marcus Rohrbach,and Bernt Schiele. Multi-view pictorial structures for 3Dhuman pose estimation. In
BMVC , 2013.[2] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, andBernt Schiele. 2D human pose estimation: New benchmarkand state of the art analysis. In
CVPR , pages 3686–3693,2014.[3] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka,Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorialstructures for multiple human pose estimation. In
CVPR ,pages 1669–1676, 2014.[4] Magnus Burenius, Josephine Sullivan, and Stefan Carlsson.3D pictorial structures for multiple view articulated pose es-timation. In
CVPR , pages 3618–3625, 2013.[5] Xianjie Chen and Alan L Yuille. Articulated pose estimationby a graphical model with image dependent pairwise rela-tions. In
NIPS , pages 1736–1744, 2014.[6] Hai Ci, Chunyu Wang, Xiaoxuan Ma, and Yizhou Wang. Op-timizing network structures for 3d human pose estimation. In
ICCV , 2019.[7] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, andGang Wang. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In
CVPR , 2018.[8] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorialstructures for object recognition.
IJCV , pages 55–79, 2005.[9] Martin A Fischler and Robert A Elschlager. The representa-tion and matching of pictorial structures.
IEEE Transactionson computers , pages 67–92, 1973.[10] Juergen Gall, Bodo Rosenhahn, Thomas Brox, and Hans-Peter Seidel. Optimization and filtering for human motioncapture.
IJCV , 87(1-2):75, 2010.[11] Catalin Ionescu, Dragos Papava, Vlad Olaru, and CristianSminchisescu. Human3. 6m: Large scale datasets and pre-dictive methods for 3D human sensing in natural environ-ments.
T-PAMI , pages 1325–1339, 2014.[12] Yasamin Jafarian, Yuan Yao, and Hyun Soo Park. Monet:Multiview semi-supervised keypoint via epipolar divergence. arXiv preprint arXiv:1806.00104 , 2018.[13] Angjoo Kanazawa, Michael J Black, David W Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In
CVPR , 2018.[14] Ilya Kostrikov and Juergen Gall. Depth sweep regressionforests for estimating 3D human pose from images. In
BMVC , page 5, 2014. [15] Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel,and Christian Theobalt. Markerless motion capture of inter-acting characters using multi-view image segmentation. In
CVPR , pages 1249–1256. IEEE, 2011.[16] Julieta Martinez, Rayat Hossain, Javier Romero, and James JLittle. A simple yet effective baseline for 3D human poseestimation. In
ICCV , page 5, 2017.[17] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa-nis, and Kostas Daniilidis. Coarse-to-fine volumetric pre-diction for single-image 3D human pose. In
CVPR , pages1263–1272, 2017.[18] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpa-nis, and Kostas Daniilidis. Harvesting multiple views formarker-less 3D human pose annotations. In
CVPR , pages1253–1262, 2017.[19] Helge Rhodin, Mathieu Salzmann, and Pascal Fua. Unsu-pervised geometry-aware representation for 3d human poseestimation. In
ECCV , September 2018.[20] Helge Rhodin, J¨org Sp¨orri, Isinsu Katircioglu, Victor Con-stantin, Fr´ed´eric Meyer, Erich M¨uller, Mathieu Salzmann,and Pascal Fua. Learning monocular 3d human pose estima-tion from multi-view images. In
CVPR , pages 8437–8446,2018.[21] Tomas Simon, Hanbyul Joo, Iain Matthews, and YaserSheikh. Hand keypoint detection in single images using mul-tiview bootstrapping. In
CVPR , pages 1145–1153, 2017.[22] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deephigh-resolution representation learning for human pose esti-mation.
CVPR , 2019.[23] Denis Tome, Christopher Russell, and Lourdes Agapito.Lifting from the deep: Convolutional 3D pose estimationfrom a single image.
CVPR , pages 2500–2509, 2017.[24] Denis Tome, Matteo Toso, Lourdes Agapito, and Chris Rus-sell. Rethinking pose in 3D: Multi-stage refinement and re-covery for markerless motion capture. In , pages 474–483, 2018.[25] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An-drew W Fitzgibbon. Bundle adjustmenta modern synthesis.In
International workshop on vision algorithms , pages 298–372. Springer, 1999.[26] Matthew Trumble, Andrew Gilbert, Adrian Hilton, and JohnCollomosse. Deep autoencoder for combined human poseestimation and body model upscaling. In
ECCV , pages 784–800, 2018.[27] Matthew Trumble, Andrew Gilbert, Charles Malleson,Adrian Hilton, and John Collomosse. Total capture: 3D hu-man pose estimation fusing video and inertial sensors. In
BMVC , pages 1–13, 2017.[28] Chunyu Wang, Yizhou Wang, Zhouchen Lin, and Alan LYuille. Robust 3d human pose estimation from single im-ages or video sequences.
T-PAMI , 41(5):1227–1241, 2018.[29] Chunyu Wang, Yizhou Wang, Zhouchen Lin, Alan L Yuille,and Wen Gao. Robust estimation of 3d human poses from asingle image. In
CVPR , pages 2361–2368, 2014.[30] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and YaserSheikh. Convolutional pose machines. In
CVPR , pages4724–4732, 2016.31] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselinesfor human pose estimation and tracking. In
ECCV , pages466–481, 2018.[32] Yi Yang and Deva Ramanan. Articulated pose estimationwith flexible mixtures-of-parts. In
CVPR , pages 1385–1392,2011.[33] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, andYichen Wei. Towards 3D human pose estimation in the wild:a weakly-supervised approach. In
ICCV , pages 398–407,2017.[34] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and YichenWei. Flow-guided feature aggregation for video object de-tection. In