[PDF] 3D Pose Detection in Videos: Focusing on Occlusion

Abstract

In this work, we build upon existing methods for occlusion-aware 3D pose detection in videos. We implement a two stage architecture that consists of the stacked hourglass network to produce 2D pose predictions, which are then inputted into a temporal convolutional network to produce 3D pose predictions. To facilitate prediction on poses with occluded joints, we introduce an intuitive generalization of the cylinder man model used to generate occlusion labels. We find that the occlusion-aware network is able to achieve a mean-per-joint-position error 5 mm less than our linear baseline model on the Human3.6M dataset. Compared to our temporal convolutional network baseline, we achieve a comparable mean-per-joint-position error of 0.1 mm less at reduced computational cost.

Full PDF

33D Pose Detection in Videos: Focusing on Occlusion

Justin Wang [email protected]

Edward Xu [email protected]

Kangrui Xue [email protected] ukasz Kidziski [email protected]

Abstract

In this work, we build upon existing methods forocclusion-aware 3D pose detection in videos. We imple-ment a two stage architecture that consists of the stackedhourglass network to produce 2D pose predictions, whichare then inputted into a temporal convolutional network toproduce 3D pose predictions. To facilitate prediction onposes with occluded joints, we introduce an intuitive gen-eralization of the cylinder man model used to generate oc-clusion labels. We ﬁnd that the occlusion-aware network isable to achieve a mean-per-joint-position error 5 mm lessthan our linear baseline model on the Human3.6M dataset.Compared to our temporal convolutional network baseline,we achieve a comparable mean-per-joint-position error of0.1 mm less at reduced computational cost.

1. Introduction

Human pose detection has been an active area of researchin the deep learning community since 2014 with Toshevet. al.’s DeepPose [7], a work focused on 2D pose esti-mation. The problem involves detecting joint positions andbone lengths of humans in the context of both images andvideos. In 3D pose estimation, additional challenges arisein regards to projecting from 2D image data onto 3D key-point coordinates and vice-versa. However, in applicationsrequiring extensive, realistic tracking of human motion, 3Dpose estimation emerges as the only viable option. Practicalexamples include sports and medical ﬁelds, where athletescan analyze their footwork and body form in sharp detail,and medical professionals can deeply understand a patient’sgait before entering the operating room.We focus on improving 3D pose estimation in video foroccluded cases. Occluded joints are body joints that cannotbe seen by the camera, either blocked by other joints, otherbody parts, or external objects. Researchers have shownthat occlusion is a signiﬁcant source of error in state-of-the-art models for both 2D and 3D pose estimation of single im- ages [4, 8]. A recent body of literature by Cheng et. al. [1]has focused on 3D pose estimations in video, speciﬁcallytackling the problem of occlusion by making use of tempo-ral information provided by videos which is unavailable insingle frames.We build a temporal convolutional network (TCN) for3D pose estimation to work well speciﬁcally for occlusioncases. We plan to work with well-known 3D pose datasets,HumanEva and Human 3.6M, but we will primarily focuson HumanEva due to its simpliﬁed structure and lower re-source and computational power requirements.Since previous works have trained and ﬁne-tuned 2Dpose estimation models using Stacked Hourglass architec-tures on the frames of our datasets, we are able to acquirepredicted 2D poses, provided by [5] in the form of jointkeypoints. From these 2D joint coordinates, we also gen-erate occlusion predictions based on predeﬁned heuristics,where we label the joint if it is occluded and otherwise.The inputs to our model are the predicted 2D joint coordi-nates and occlusion vector. Our labels are the 3D groundtruth poses from the dataset in the form of keypoints. Wealso include ”ground truth” occlusion labels ( or ) gener-ated from ground truth 3D keypoints using our own baselineheuristic. The outputs of our model are predicted 3D poses,represented as 3-dimensional coordinates in the camera’sframe.

2. Related Work

A landmark improvement in 2D pose estimation camevia Newell et. al.’s Stacked Hourglass architecture [4]. Mo-tivated by an understanding that human poses are best cap-tured at different levels of detail (e.g. the location of facesand hands as opposed to the person’s overall orientation),this architecture consists of pooling and upsampling layerswhich looks like an hourglass, hence the name.Building on the success of the Stacked Hourglass on 2Dpose estimation, Martinez et. al. [3] constructed a simplebaseline which used the 2D pose estimations produced bythe Stacked Hourglass model as inputs into a linear model1 a r X i v : . [ c s . C V ] J un o produce 3D pose estimations.Pavllo et. al. [5] focuses on 3D pose estimation onvideo. Speciﬁcally, [5]’s approach uses a temporal convo-lutional network (TCN) instead of a linear model on top ofthe 2D pose estimations produced by the Stacked Hourglassmodel. Earlier methods incorporated recurrent neural net-works (RNN) to capture the temporal relationship betweenframes in a video, and the temporal convolutional archi-tecture builds on this relationship by allowing for parallelprocessing of multiple frames, something not possible withrecurrent architectures.However, [5] did not speciﬁcally focus on occlusion andtherefore had many problems predicting occluded joints. Arecent body of literature by Cheng et. al. [1] has focusedon 3D pose estimations in video, speciﬁcally tackling theproblem of occlusion. We seek to build off their implemen-tation and results. To do so, we investigate why their modelsucceeds. Videos provide a sequence of frames that providetemporal information to better inform a model’s estimationsin occluded settings by providing a context in which an oc-cluded joint should be located. Cheng et. al. [1] uses anocclusion-aware convolutional neural network to mitigatethe effects of 3D pose estimation in occluded videos. Their”Cylinder Man Model” is a heuristic that maps 3D groundtruth joint keypoints to 2D ground truth pose heatmaps andtakes occlusion into account. With the 2D ground truthheatmaps, they are able to come up with occlusion labels foreach of the 2D keypoints. Cheng et. al. takes predicted 2Dposes from a Stacked Hourglass model, ﬁlters out occludedjoints, and trains a 2D temporal convolutional network tosmooth the predicted 2D keypoints. Lastly, they input thesmoothed 2D keypoints and occluded predictions into a 3Dtemporal convolutional network and generate 3D pose pre-dictions, using 3D ground truth poses and occlusion labels.Cheng et. al. primarily use 2D ”ground truth” heatmapsto train a Stacked Hourglass model to output 2D predictedheatmaps with more occlusions. Although they also inputocclusion labels to their TCN, we seek to actually train ourTCN to recognize the ground truth occlusions by explicitlyadding to the loss function. We plan to iterate upon theircylinder man model heuristic to deal with occlusion.

3. Datasets

The Human 3.6M dataset [2], a 3D Pose dataset, con-sists of 3.6 million images from actors performing a fewdaily-life activities. There are a total of 4 cameras and 7annotated subjects. The preprocessed Human3.6M datasetconsists of 3D ground truth joint keypoints ordered by cam-era used during recording, camera parameters, actions con-ducted by subject, and subject names. As with the temporalconvolutional and linear baseline models, we use Subjects

Figure 1. Visualization generated by Martinez et al. [3]. The leftcolumn corresponds to 2D joint coordinates, the middle to groundtruth 3D joint coordinates, and the right to predicted 3D join coor-dinates from 2D heatmaps.

1, 5, 6, 7, 8 for training, and reserve Subjects 9 and 11 forevaluation.We have access to the pre-processed dataset and havecontacted the curators to gain access to the original videoframes. However, we still have yet to hear a response.

We have gained full access to the HumanEva-I dataset[6], which contains 3D ground truth keypoints per frame,representing 15 joints (pelvis, thorax, shoulders, elbows,wrist, hips, knees, ankles, head), along with their originalvideo frames. There are a total of seven video-sequences(four grayscale and three color) of four annotated subjectsperforming ﬁve different actions (Walking, Jogging, Throw-ing/Catching, Gesturing, and Boxing).We generate our training, validation, and test data us-ing a modiﬁed version of the pre-processing step used in[5]. Speciﬁcally, this pre-processing step involves calculat-ing the projection from 3D ground truth joint keypoints to2D joint keypoints using the provided motion capture andcamera calibration data. Additionally, since the originalvideo frames occasionally contain chunks of invalid jointkeypoint measurements, we simply discard these.For the remaining video frames, we only consider thethree color cameras, the ﬁrst three subjects, and the ﬁrstvideo sequence take of each action. This leaves us with atotal of 28,731 data entries, partitioned into a roughly 50/50training and validation split.

4. Method

In order to generate ”ground truth” occlusion labels for2D poses, we start with a simple heuristic called ”ClusteredOcclusions”, similar to the method of [1]. For every frame,we have 17 joint keypoints k in the form of 3D camera co- igure 2. Ground truth 2D heatmaps (top) and predicted 2Dheatmaps (bottom) for a sequence of poses. The ground truthheatmaps have less peaks (occluded keypoints). ordinates k i = ( x i , y i , z i ) for the Human3.6M dataset, or15 joint keypoints for the HumanEva-I dataset. Becausemost of the joints in these frames are occluded by otherbody parts or joints, our intuition to ﬁnding which joints areoccluded is to ﬁnd joints that are clustered together in the xy -plane, then mark the joint closest to the camera (small-est depth) as not occluded, and mark the other joints in thecluster as occluded. In other words, for each joint coordi-nate k i , we ﬁnd the set of keypoints k j where (cid:113) ( x i − x j ) + ( y i − y j ) < (cid:15) and (cid:15) is the tunable tolerance parameter. We currently use (cid:15) = 0 . . Then, we add k i and all keypoints k j to form acluster S , and our non-occluded keypoint index n from thisset S is n = arg min( z p ) where k p ∈ S . We mark all other points in set S as oc-cluded. We hope that this heuristic can generally fetch alljoints that are observable by the camera, as we believe jointsin close proximity occlude one another. This gives us a vec-tor of occluded joints o for each frame, where means oc-cluded and means not occluded.After applying the heuristic to get occluded joints, wegenerate ground truth 2D heatmaps for each existing 2Dpose by placing a white circle with Gaussian smoothness atthe image coordinates for joints that are not occluded, anddoing nothing for joints that are occluded. Then, we take acenter crop of the heatmap to crop the subject and resize thewidth and height to 128. Our ground truth and predicted 2Dheatmaps can be visualized in Figures 2 and 3.This is only a simple heuristic, and we hope to test outhow adding the ground truth occlusion labels affect our er-ror. To improve our heuristic, we take inspiration from thecylinder man model delineated in [1]. The cylinder manmodel generates occlusion labels for 2D poses using 3Dposes (Figure 3). Speciﬁcally, it models the head, body,

Figure 3. Up close example of the occluded joints omitted inground truth heatmap (top) compared to predicted heatmap (bot-tom). arms, and legs as cylindrical segments. In general, if a cer-tain joint is located within another joint’s cylindrical seg-ment in 3D space, it is deemed occluded in the 2D space.More speciﬁcally, this technique calculates a visibility vari-able to determine the degree of occlusion for each joint.We take this idea and adapt it to our computing needs,and introduce an occlusion technique that requires less com-putational power and memory than the original cylinderman model. We propose a boxed man model that generatesocclusion labels for 2D poses using the original 2D poses.We visualize the original cylinder man, with equivalent pro-portions, squashed into 2D space.For example, in the case of the Human3.6M dataset, key-points 9 and 10 represent the top and bottom of the subject’shead. Deﬁne these keypoints to be A and B , as seen inFigure 4. We use these two keypoints and project them tofour points A , A , B , B which determine the bounds ofour boxed approximation of the head. We determine thefour points by ﬁrst calculating slope of the line AB , whichwe deﬁne as m AB , to then ﬁnd the perpendicular slope m (cid:48) AB = − m AB . We then deﬁne A = ( A x − cos(arctan( m (cid:48) AB )) ∗ δ,A y − sin(arctan( m (cid:48) AB ))) ∗ δ ) A = ( A x + cos(arctan( m (cid:48) AB )) ∗ δ,A y + sin(arctan( m (cid:48) AB ))) ∗ δ ) where δ is a hyperparameter that determines the width ofthe boxes. B and B are deﬁned similarly and instead use B ’s coordinates B x and B y . The box that entails the chestand torso area is deﬁned by the four points that are providedin the keypoints, speciﬁcally 1, 4, 11, 14 in the Human3.6Mdataset.If a joint in the boxed man model is located within an-other joint’s boxed segment in 2D space, we deem it oc-cluded. This simpler heuristic encourages our temporalconvolutional network to learn poses based on joints whichdeﬁnitively not occluded. We believe that the temporal con-volutional network will be able to learn poses for occluded igure 4. A visualization of the cylinder man model [1]. Each armand leg are scaled to a diameter of 5 cm, and the head is scaled toa diameter of 10 cm.Figure 5. The setup for the boxed man model. Note that ˆ i and ˆ j represent unit vectors for the x and y axes. joints from other camera positions in the dataset, and shouldlearn as much as possible from the original data rather thanbe forced to learn from certain joints by a heuristic. Our main model that produces 3D poses is the 3D tempo-ral convolutional network (TCN) adapted from Pavllo et. al.[5]. Taking a consecutive sequence of 2D joint keypoints, ituses temporal convolutions and residual connections to pre-dict a frame’s 3D joint coordinates. The input layer appliesa temporal convolution over the 2D keypoints with kernelsize W = 3 , expanding the number of channels from twotimes the number of joints (for each x and y coordinate)to C = 1024 . Then, the model goes through B Resnet-type blocks, which are connected through skip layers. Eachblock has a 1D convolution with kernel size W and C chan-nels, followed by another convolution with kernel size .All convolutional layers are followed by batch normaliza-tion, ReLU activation, and dropout layers. Finally, the lastlayer shrinks the number of channels and outputs the pre-dicted 3D pose keypoints.Because we are focusing on occlusion, we also input asequence of predicted occluded joints into our TCN, whereevery joint is either for not occluded or for occluded.First, we apply a temporal convolution to the occluded vec-tors. Then, we apply sigmoid activation and zero out key-points whose values in the occluded vectors are above somethreshold. By doing this, we are effectively trying to learnthe ground truth occluded vectors over the temporal convo- lution, hoping that we can successfully zero out the actuallyoccluded keypoints before applying convolutions over the2D keypoint sequence.We explore two variants of our model with this input.Our ﬁrst variant uses down-convolutions to drop from atemporal range of vectors to one occluded vector for theframe of the 3D pose we are predicting. This way, theoutputted occluded vector can be directly compared to theframe’s ground truth occluded vector in the loss. Instead offocusing on learning only one occluded vector, our secondvariant does not have any down-convolutions and tries tolearn the whole sequence of ground truth occluded vectors. With the ground truth occlusion labels, we can now trainour TCN to notice the occluded joints. To do so, we canadd a loss to our existing loss function L , which is the meanper joint positional error (MPJPE) between estimated andground truth 3D poses: L = 1 M N M (cid:88) k =1 N (cid:88) i =1 || ( J ( k ) i − J ( k ) root ) − ( ˆ J ( k ) i − ˆ J ( k ) root ) || where M represents the number of examples and N is thenumber of joints. Now, let the ground truth occlusion vec-tor for a frame be o and the predicted occlusions be ˜ o , ourmodiﬁed loss function L (cid:48) will be: L (cid:48) = λ L + λ M N M (cid:88) k =1 N (cid:88) i =1 | o ( k ) i − ˜ o ( k ) i | where λ and λ are tunable weights.

5. Experiments

Pavllo et. al. [5] implemented 3D pose estimation invideos using temporal convolutions and semi-supervisedtraining. We use this model as a baseline for our 3D poseestimation as the model does not actively seek to solve theproblem of occlusion.Another baseline whose work we seek to build upon andcompare against is [1]. Because their network is occlusion-aware, we hope to achieve similar results or possibly im-prove on the problems that they face.

We evaluate our models using the original mean per jointpositional error (MPJPE) between estimated and groundtruth 3D poses in millimeters (mm). The mean is calcu-lated over all N joints used in each image frame; in thiscase, N = 17 for the Human3.6M dataset, and N = 15 forthe HumanEvaI dataset. We ﬁrst align the pelvis as the rootoint before comparing differences in Euclidian distance be-tween the poses. The joints are also normalized with respectto the root joint. We trained and evaluated our model on HumanEva, look-ing at Subjects 1, 2, and 3 and Actions Walk, Jog, and Boxas they are the subjects and actions focused on in [5]. Weseek to compare mostly against [5] because we use the samenetwork, but we also compare our results to [1]. Results areshown in Tables 1 through 5. Tables 1 through 3 show theresults for different methods over actions, Table 4 showsCheng et.al.’s results, and Table 5 shows the averages overall actions and subjects for methods.Walk Subject 1 Subject 2 Subject 3 Average1 13.9 10.2 46.6 23.62 14.4 10.2 46.8 23.83 14.1

Table 1. MPJPE of the action Walking for HumanEva (mm). (1:Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clus-tered; 4: Many vectors, Boxed Man). Our multiple occlusion vec-tor method coupled with the boxed man model keypoints achievesthe best average MPJPE across subjects. Bolded numbers are thebest among the methods.

Jog Subject 1 Subject 2 Subject 3 Average1 20.9 13.1 13.8 15.92

Table 2. MPJPE of the action Jogging for HumanEva (mm). (1:Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clus-tered; 4: Many vectors, Boxed Man). Our one vector method cou-pled with the simple clustered heuristic achieves the best averageMPJPE across subjects. Bolded numbers are the best among themethods.

Box Subject 1 Subject 2 Subject 3 Average1 23.8 33.7 32.0 29.82

Table 3. MPJPE of the action Boxing for HumanEva (mm). (1:Pavllo et.al. [5]; 2: One vector, Clustered; 3: Many vectors, Clus-tered; 4: Many vectors, Boxed Man). Our one vector method cou-pled with the simple clustered heuristic achieves the best averageMPJPE across subjects. Bolded numbers are the best among themethods.

From these results, our variant of the temporal convolu-tional model that compares a sequence of occluded vectors Action Subject 1 Subject 2 Subject 3 AverageWalk 11.7 10.1 22.8 14.9Jog 18.7 11.4 11.0 13.7

Table 4. MPJPE for HumanEva by Cheng et.al. [1] (mm). Theyseem to have amazing performances across the board, most likelybecause they have extend their method from end-to-end and useheatmaps to be occlusion-aware. We do not have the computepower that they do to add heatmaps to our model.

Method AveragePavllo et.al. [5] 23.11One vector, Clustered 23.06Many vectors, Clustered 23.06Many vectors, Boxed Man

Table 5. Average MPJPE for HumanEva over all subjects and ac-tions considered (mm). Our many vector variant with the boxedman model occlusion keypoints seems to work the best, beatingthe baseline and our other methods. to the sequence of ground truth vectors, coupled with ourBoxed Man Model, works the best, achieving an averageMPJPE of 23.01 over all subjects and actions. Our othervariant that only uses the single frame occluded vector in theloss also performed better than our initial baseline, showingthat the Clustered occlusion heuristic also worked well toprevent occluded joints from being wildly predicted.

After one epoch and convergence respectively, the linearbaseline was able to achieve the following results on 3 ofthe 15 tasks, and on average:Directions Photo SittingDown Average60.86 95.53 117.99 77.2439.5 56 69.4 47.7

Table 6. MPJPE of the linear baseline (mm).

After convergence, as reported in [1], the occlusionaware network was able to achieve the following results:Directions Photo SittingDown Average38.8 51.9 58.4 42.9

Table 7. MPJPE of the occlusion-aware model (mm).

We initially tried inputting 32 by 32 predicted andground truth heatmaps into the TCN to combat occlusion,similar to [1]. However, because of our lack of computa-tional power, we only were able to train a subset of the datafor a few epochs. We ended up with an test error of 174.66m. We also tried changing from occluded heatmaps to oc-cluded vectors, and we ended up with a test error of 50.36mm. Our results on Human 3.6M are deﬁnitely stunted bythe low computation power and the sheer size of the dataset.

For our baseline results, we selected the three tasks Direc-tions, Photo, and SittingDown, because we believe they bestexhibit the differences in the model’s performances. Thedifference in performance on the Photo action is aroundthe mean of the differences in performance with respect toall actions. While the variation in MPJPE for the Direc-tions task is small between the linear baseline and occlusionaware network, it is signiﬁcantly larger for the SittingDownaction. We believe this is due to the occluded nature of theaction, as well as the signiﬁcance of temporal information.Sitting down involves the knee joints occluding the hip jointat the end of the action from a ventral camera orientation.Given that the linear model is trained over single images, itis unable to learn the hip joint’s trajectory over the courseof multiple frames. On the other hand, the occlusion awarenetwork is able to use both its heuristic of whether or notthe hip joint is occluded along with the hip joint’s trajectoryvia a sequence of video frames to predict where the hip jointshould be located accurately.

Naturally, due to the boxed man model’s anatomically de-rived occlusion method, it performed better than the base-line Gaussian model. We believe that another source of in-creased performance for the boxed man model is its strongertendency to mark a given joint as occluded as opposed to thebaseline. Occlusion serves as a form of dropout or regular-ization of the model. Speciﬁcally, we believe that feedingthe network information about whether or not a joint is oc-cluded eventually teaches the model to rely less on jointsthat are marked as occluded. Given that different actionscause inherently different occlusion patterns, the model willbe less inclined to focus on a few joints during training.The boxed man model’s lax requirements on occlusion al-lows for a wider variation of non-occluded keypoint permu-tations.

6. Conclusion

Most of our work focuses on adapting temporal con-volutional network to predict occluded human 3D posesfrom 2D ground truth heatmaps, and indeed, the averagemean-per-join-position error across our three network vari-ants was comparable to but nonetheless still lower than thebaseline architecture in [5]. However, since this only con- stitutes the second half of the video to estimated 3D humanpose pipeline, work remains to be done in improving oc-cluded heatmap generation in the ﬁrst place.For example, training a stacked hourglass network us-ing our clustered ground truth occlusions and boxed manmodels would make our task more consistent end-to-end.We initially began experimenting with such methodologiesand re-conﬁgured an existing stacked hourglass implemen-tation [4] (originally conﬁgured to work with the MPII hu-man pose dataset) to work with the HumanEva-I dataset.However, due to limited computing resource, we decided tofocus on training the temporal convolutional network.Another similar idea involves adding data augmentationby manually blocking out non-occluded joints. This couldsimply take the form of dropping out random joints dur-ing training of the stacked hourglass network, or since sig-niﬁcant pre-processing is already being performed on thevideos, it could perhaps even involve editing the framesthemselves.

References [1] Y. Cheng, B. Yang, B. Wang, Y. Wending, and R. Tan.Occlusion-aware networks for 3d human pose estimation invideo. In , pages 723–732, 2019. 1, 2, 3, 4, 5[2] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Hu-man3.6m: Large scale datasets and predictive methods for 3dhuman sensing in natural environments.

IEEE Transactions onPattern Analysis and Machine Intelligence , 36(7):1325–1339,jul 2014. 2[3] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simpleyet effective baseline for 3d human pose estimation. In

ICCV ,2017. 1, 2[4] A. Newell, K. Yang, and J. Deng. Stacked hourglass networksfor human pose estimation. In B. Leibe, J. Matas, M. Welling,and N. Sebe, editors,

Computer Vision - 14th European Con-ference, ECCV 2016, Proceedings , Lecture Notes in Com-puter Science (including subseries Lecture Notes in ArtiﬁcialIntelligence and Lecture Notes in Bioinformatics), pages 483–499, Germany, Jan. 2016. Springer Verlag. 1, 6[5] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3d hu-man pose estimation in video with temporal convolutions andsemi-supervised training. In

Conference on Computer Visionand Pattern Recognition (CVPR) , 2019. 1, 2, 4, 5, 6[6] L. Sigal, A. Balan, and M. J. Black. HumanEva: Synchro-nized video and motion capture dataset and baseline algo-rithm for evaluation of articulated human motion.

Interna-tional Journal of Computer Vision , 87(1):4–27, Mar. 2010. 2[7] A. Toshev and C. Szegedy. Deeppose: Human pose estima-tion via deep neural networks. In , pages 1653–1660,2014. 1[8] J. Walker, K. Marino, A. Gupta, and M. Hebert. The poseknows: Video forecasting by generating pose futures.