Human Pose Estimation on Privacy-Preserving Low-Resolution Depth Images
HHuman Pose Estimation on Privacy-PreservingLow-Resolution Depth Images
Vinkle Srivastav , Afshin Gangi , , and Nicolas Padoy ICube, University of Strasbourg, CNRS, IHU Strasbourg, France {srivastav|padoy}@unistra.fr Radiology Department, University Hospital of Strasbourg, France
Abstract.
Human pose estimation (HPE) is a key building block fordeveloping AI-based context-aware systems inside the operating room(OR). The 24/7 use of images coming from cameras mounted on theOR ceiling can however raise concerns for privacy, even in the case ofdepth images captured by RGB-D sensors. Being able to solely use low-resolution privacy-preserving images would address these concerns andhelp scale up the computer-assisted approaches that rely on such data toa larger number of ORs. In this paper, we introduce the problem of HPEon low-resolution depth images and propose an end-to-end solution thatintegrates a multi-scale super-resolution network with a 2D human poseestimation network. By exploiting intermediate feature-maps generatedat different super-resolution, our approach achieves body pose results onlow-resolution images (of size 64x48) that are on par with those of anapproach trained and tested on full resolution images (of size 640x480).
Keywords:
Human Pose Estimation · Privacy Preservation · Depth Im-ages · Low-resolution Data · Operating Room
Modern hospitals could highly benefit from the use of smart assistance systemsthat are able to support the workflow by exploiting digital data from equipmentand sensors through artificial intelligence and surgical data science [11,12]. Thisis illustrated by the recent development of new applications, such as patientactivity monitoring inside intensive care units (ICU) [10], staff hand-hygienerecognition [5], radiation exposure monitoring during hybrid surgery [13] andworkflow steps recognition in the operating room (OR) [18].These systems, which have a huge potential to improve safety and care, allrely on machine intelligence using computer vision models to extract semanticinformation from visual data. In particular, human detection and pose estima-tion in the operating room [1,6,8] is one of the key components to develop such † Published at International Conference on Medical Image Computing andComputer-Assisted Intervention - MICCAI 2019. a r X i v : . [ c s . C V ] J u l Srivastav et al.(a) 640x480 (1x) (b) 80x60 (8x) (c) 64x48 (10x)
Fig. 1: Depth and color images from MVOR dataset [17] down-sampled at dif-ferent resolutions using bicubic interpolation (resized for better visualization).Low-resolution depth images contain little information for the identification ofpatient and health professionals. Corresponding color images in the second roware shown for better appreciation of the downsampling process.applications. Constant monitoring by the use of cameras raises however potentialconcerns for the privacy of patients and health professionals. Cameras usuallycapture the color, depth or both types of images for visual processing. Colorimages appear to be the most privacy-intrusive, but even textureless depth im-ages can also intrude the privacy when used at sufficiently high-resolution [3,4].This is particularly relevant in environments where the number of persons islimited and where the persons could potentially be more easily identified. Figure3 shows depth images at different resolutions. It suggests that low-resolutionimages could be used for more privacy-compliant computer-vision applicationsand that their recording could be better accepted by clinical institutions. In [4],it has been shown that activity recognition can be performed on low-resolutiondepth images captured for the tasks of hand-hygiene classification and ICU ac-tivity logging. In this work, we investigate whether low-resolution depth imagescontain sufficient information for accurate human pose estimation (HPE).HPE consists of localizing human keypoints in images. Methods for humanpose estimation are different for color and depth images both in terms of modelarchitectures and complexity of training datasets. In the case of color images,deep learning models have recently shown remarkable progress with the help oflarge scale in the wild annotated datasets, such as COCO [9]. Deep learningmodels for HPE can generally be grouped into bottom-up and top-down ap-proaches. Bottom-up approaches first detect the keypoints and then group themto form skeletons [2], whereas top-down approaches first detect the person usingperson detectors and then use single person pose estimator to estimate body uman Pose Estimation on Low-Resolution Depth Images 3 joints in each detected box [19]. Top-down approaches are often more accuratedue to their two-stage design but slower in comparison to bottom-up approaches.For depth images, Shotton et al. [16] use high-resolution synthetic depth datasetto train the models, while Haque et al. [6] focus on single person pose esti-mation using datasets recording actors performing simulated actions. Recently,Srivastav et al. [17] have introduced the MVOR dataset, which contains colorand depth images captured in the OR along with ground truth human poses.They have also evaluated recent HPE methods. This is therefore an interestingtestbed for multi-person pose estimation on depth data captured during realsurgical activities, which we will use in this work.Current methods for HPE inside the OR have been developed using standardresolution images [1,8]. We have found that state-of-the-art models, which aretrained on the high-resolution images, perform poorly on the corresponding low-resolution images. In this paper, we therefore propose an approach for the humanpose estimation problem on low-resolution depth images. To the best of ourknowledge, this is the first work that attempts to solve this task.To train our system, we use a non-annotated dataset of synchronized RGB-Dimages captured in the OR environment. Unlike conventional approaches, whichuse either manual or synthetically rendered annotations challenging to generate,we propose to use the detections from a state-of-the-art method applied to thecolor images as pseudo ground truth for the corresponding depth images. Thissimple idea turns out to be very effective. Indeed, as our approach only requiresa set of RGB-D images at train time, it can be easily retrained in any facilitysince no annotation process is needed. Then, it can run round the clock onlow-resolution depth images from the same facility. Our HPE approach is anetwork which integrates super-resolution modules with a 2D multi-person bodykeypoint estimator based on RTPose [2]. It utilizes intermediate super-resolutionfeature maps to better learn the high-frequency features. With the proposedarchitecture, we achieve the same results as a network trained on the standardresolution images and improve by 6 .
5% the results of a baseline method whichup-samples the low-resolution images with bicubic interpolation before feedingthem to the pose estimation network.
Our approach is inspired by the recent developments in the area of super-resolution and multi-person human pose estimation. We propose to integratea super-resolution image estimator and a 2D multi-person pose estimator in ajoint architecture, illustrated in Figure 2. This architecture is based on modi-fication from the RTPose network [2]. Besides yielding competitive results onCOCO and MVOR, RTPose has the advantage to perform multi-person pose es-timation in a single step, thereby simplifying the integration and training of thesuper-resolution modules. It is composed of a feature extraction block and a pose estimation block shown in Figure 2. We introduce a super-resolution
Srivastav et al. : feature concatination
Number of stages = 3
Super Resolution Block SR Features BlockFS1S2
Loss (B*) C1 LR C2 C3 C4 C7C8 C9C8 C6 C10
C10
C13
C13 C15C16 SR C5 : 2x2 upsampling using pixel-shuffle Feature Extraction Block Pose Estimation Block Part Affinity Branch Part Affinity BranchKeypoint Localization Branch Keypoint Localization Branch B C B t C t Loss (B*)
Loss (C*)
Loss (C*)
Loss (HR)
C11C12 C14
C1: c3(1,64)->c3(64,128)->c3(128, 256)
C2: c3(64,64)->c3(64,128)->c3(128, 256)
C3: c3(64,64)->c3(64,128)->c3(128, 256)->c3(256, 256)
C4: c3(64,64)->c3(64,64)
C5: c3(64,64)->c3(64,128)
C6: c3(128,256)->c3(256,256)->c3(256,256)->c3(256,256)
C7: c3(256,512)->c3(512,512)->c3(512,256)->c3(256,128)
C8: c3(64,32)
C9: c3(32,32)
C10: c3(128,128)->c3(128,128)->c3(128,128)->c1(128,512)
C11: c1(512,38)
C12: c1(512,19)
C13: c7(249,128)->c7(128,128)->c7(128,128)->c7(128,128)->c7(128,128)->c1(128,128)
C14: c1(128,38)
C15: c1(128,19)
C16: c3(64,1): 2x2 maxpooling
Fig. 2: Proposed architecture. The super-resolution block increases the spatialresolution by a factor of 8x and generates intermediate SR feature maps (S1, S2)used by the pose estimation block to learn high-frequency features. All losses aremean square error losses. C1 to C16 are convolution layers grouped together forbetter visualization and described below the figure, where c1(n1,n2), c3(n1,n2),c7(n1,n2) each represent a convolution layer with kernel size 1x1, 3x3, 7x7 andpadding 0, 1, 3, respectively. Parameters n1 and n2 are the numbers of input andoutput channels and all convolution layers are followed by RELU non-linearity. block , which does not only increase the spatial resolution but also generatessuper-resolution (SR) feature maps (S1, S2). These intermediate feature-mapscontain high-frequency details, which are lost during the low-resolution (LR)image generation process and used in the pose estimation block for better lo-calization. The super-resolution block uses a multi-stage design, where eachstage increases the spatial resolution of the features maps by a factor of two usingthe pixel-shuffle algorithm [15] (while reducing the number of channels by four).During training, a complete SR image is generated to compute the auxiliaryloss
L HR , which compares the SR image to the ground truth high-resolution(HR) depth image using the L2 norm. This helps to train the super-resolutionblock and refines the input to the
SR features block . Note that during train-ing, errors from the pose estimation are also back-propagated to these blocks.Furthermore, at test time only LR images are used and no SR images need tobe generated by the network since only the SR feature maps are used.RTPose was originally developed for color images. Since depth images con-tain fewer texture details, we have made the architecture more computationallyefficient by reducing the number of iterative refinement stages from five to three. uman Pose Estimation on Low-Resolution Depth Images 5
The network uses two separate branches, one for keypoint localization and an-other to compute part affinity maps [2]. In our architecture, these two branchesconsume the 3 types of features (F, S1, S2), where F are the features extractedfrom the high-resolution feature maps provided by the super-resolution block.The final skeleton is generated from the part affinity and keypoint localizationheatmaps using the bipartite graph matching algorithm presented in [2]. Lossesin the pose estimation network are used as in [2], but now take the input fromthe SR feature maps (S1, S2). At each stage t , two L2 losses L B t and L C t are computed from the predicted part affinity/keypoint localization heatmaps( B t / C t ) and the ground truth heatmaps ( B ∗ / C ∗ ). All the L B t and L C t lossesare summed together to form the pose estimation loss L P . Finally, the total lossis the sum of
L HR and
L P . We have chosen to weigh both terms equally aswe observe that their magnitudes are similar. The complete network is trainedend-to-end jointly for both super-resolution and pose estimation.
In the literature, authors have either used manually annotated or syntheticallygenerated datasets to train for HPE on depth images. Manual annotations canbe expensive and time-consuming, and synthetic annotations are difficult to gen-erate due to the constraint of realistic rendering and do not always generalizewell to real scenarios. Therefore, we use an alternate approach to generate an-notations. This approach is based on the observation that the RGBD camerascapture synchronized color and depth streams, and recent HPE methods trainedon the COCO dataset [9] work remarkably well on the color images. Therefore,we use detections from the color images to train the model for the depth images.To facilitate this approach, we collected an unlabeled RGBD dataset containingsynchronized 80k color and depth images captured in the OR during real sur-gical procedures. Then, we used the state-of-art person detector Mask-RCNN[7] and a single person pose estimator MSRA [19] on color images to generatedetections. We filter out the false positives and retain high-quality detectionsin both the stages using thresholds selected from the qualitative results on asmall set of images. This approach generates pseudo ground truth automaticallywithout using any human annotation efforts. It is therefore scalable and can bedeployed to any facility. For human pose estimation, we choose here a two stepsmethod based on Mask-RCNN and MSRA for their state-of-the-art performanceon the public COCO dataset. Note that such a two-step method would be lessconvenient to use in our approach, due to the large architectures involved andthe fact that super-resolution would need to be integrated into both.
Training setup:
We use the dataset of 80k images and the pseudo groundtruth described in Section 2.2 for training. It contains 20k images from fourcategories, where each category includes images with one, two, three and four
Srivastav et al. or more persons. We split the dataset into 77k training and 3k validation im-ages. When downsampling the images to sizes 80x60 and 64x48, we use bicubicinterpolation. To generate pseudo ground truth, we use a threshold of 0.7 inthe person-detector stage and then select the skeleton if at least 4 keypoints aredetected with a score greater than 0.35. We use PyTorch deep learning frame-work in our experiments. The depth images are normalized in the range [0, 255]and we train our networks using the stochastic gradient descent optimizer witha momentum of 0.9. The initial learning rate is set to 0.001 with a step decayof 0.1 after 12k iterations and each model is trained for 32k iterations with abatch size of 12. We use the pre-trained weights from the authors of RTPose toinitialize the pose-estimator networks. Note that these weights were originallyobtained using the color images from the COCO dataset. For the layers thathave been modified in the pose-estimation network and contain a larger numberof channels (e.g. to accommodate S1 and S2), we repeated the same weights andperturbed them by a small random number. The weights of the super-resolutionnetwork are initialized using orthogonal initialization [14].
Testing setup:
We evaluate our method on the publicly available depth imagesof the MVOR dataset [17], which contains images of size 640x480 captured in anOR from 3 different viewpoints during actual clinical interventions. The train-ing dataset comes from the same environment and camera setup but containsdata captured on different days. During testing, we use the flip-test, namelyaverage the original heatmaps with the heatmaps obtained after flipping theimages horizontally to refine the predictions. We use the percentage of correctkeypoints (PCK) [20] as an evaluation metric, which is widely used to measurethe localization accuracy of the detected skeletons in multi-person scenarios.
We show our results in Table 1. RTPose 640x480, RTPose 80x60, and RTPose 64x48are baseline RTPose models that do not use any super-resolution and are trainedon 640x480 (full-size), 80x60, and 64x48 size depth images, respectively. TheseRTPose variants are the original models modified to take a 1-channel input.The degraded 80x60 and 64x48 images are resampled to the original size usingbicubic interpolation to match the input size of the network. DepthPose 80x60and DepthPose 64x48 are our proposed networks directly trained on 80x60 and64x48 low-resolution images. Results show that the DepthPose 64x48 network,which uses 10x downsampled images, performs on par with the baseline trainedon full-size image. Accuracy is improved by over 6.5% compared to the baselineRTPose 64x48. DepthPose 80x60 performs even better than RTPose 640x480(an interesting fact also observed in [4] in the context of activity classification)and is 3.6% better than RTPose 80x60.We have also evaluated the quality of the pseudo ground truth by running theMask-RCNN and MSRA models on the color images from MVOR. The resultingPCK value is 76.2, showing that there still exists a gap of around 9% to be filled uman Pose Estimation on Low-Resolution Depth Images 7Head Shoulder Hip Elbow Wrist AverageRTPose 640x480 82.9 82.2 57.0 68.5 42.8 66.7RTPose 80x60 81.1 80.0 54.7 65.3 37.3 63.7RTPose 64x48 77.8 76.4 52.9 60.7 32.0 60.0DepthPose 80x60 84.3 83.8 55.3 69.9 43.3 67.3DepthPose 64x48 84.1 83.4 54.3 69.0 41.4 66.5SR+RTPose 80x60 83.5 82.7 54.1 68.1 40.5 65.8SR+RTPose 64x48 82.5 81.3 51.0 66.3 37.8 63.8
Table 1: Results of our proposed method (DepthPose) compared to the baselines(RTPose and SR+RTPose) for different image resolutions.between the depth and color images. This may also explain the improved resultsof DepthPose 80x60 model, which takes advantage of an improved architecturecompared to the full-size RTPose 640x480 model. Figure 3 shows some qualita-tive results of the DepthPose 64x48 model. Additional qualitative comparisonsare available in the supplementary material.
Comparative study without SR feature maps:
We also experiment tobetter understand the effect of using super-resolution. Instead of giving to thebaselines RTPose 80x60 and RTPose 64x48 images that are up-sampled withbicubic interpolation, we feed and train these networks with images up-sampledseparately using a super-resolution network. The super-resolution network cor-responds to the super-resolution block trained independently using loss L HR.We observe in Table 1 that this procedure (SR+RTPose) improves the overallaccuracy, but yields result inferior to DepthPose by 1.5% and 2.7% for 80x60and 64x48 images, respectively. This shows that the use of intermediate SR fea-ture maps in the pose estimation network helps to better localize keypoints.Also, SR+RTPose has the disadvantage to explicitly generate super-resolutionimages, the privacy compliance of which would need to be considered.
In this paper, we present an approach for high-resolution multi-person 2D poseestimation from low-resolution depth images. Our evaluation on the public MVORdataset shows that even with a 10x subsampling of the depth images, our methodachieves results equivalent to a pose estimator trained and tested on the original-size images. Furthermore, we show that by exploiting high-quality pose detec-tions on the color images of a non-annotated RGB-D dataset, we can generatepseudo ground truth for the depth images and train a decent OR pose estimator.These results suggest the high potential of low-resolution images for scaling upand deploying privacy-preserving AI assistance in hospital environments.
Srivastav et al.
Fig. 3: Qualitative results of the DepthPose 64x48 model on a 64x48 LR depthimage with 3 persons. Ground truth is overlaid on the color images for bettervisualization.
Acknowledgements
This work was supported by French state funds managedby the ANR within the Investissements d’Avenir program under references ANR-16-CE33-0009 (DeepSurg), ANR-11-LABX-0004 (Labex CAMI) and ANR-10-IDEX-0002-02 (IdEx Unistra). The authors would also like to thank the membersof the Interventional Radiology Department at University Hospital of Strasbourgfor their help in generating the dataset.
References
1. Belagiannis, V., Wang, X., Shitrit, H.B.B., Hashimoto, K., Stauder, R., Aoki, Y.,Kranzfelder, M., Schneider, A., Fua, P., Ilic, S., et al.: Parsing human skeletons inan operating room. Machine Vision and Applications (7), 1035–1046 (2016)2. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimationusing part affinity fields. In: CVPR. pp. 7291–7299 (2017)3. Cheng, Z., Shi, T., Cui, W., Dong, Y., Fang, X.: 3d face recognition based on kinectdepth data. In: 4th International Conference on Systems and Informatics (ICSAI).pp. 555–559 (2017)4. Chou, E., Tan, M., Zou, C., Guo, M., Haque, A., Milstein, A., Fei-Fei, L.: Privacy-preserving action recognition for smart hospitals using low-resolution depth images.NeurIPS-MLH (2018)5. Haque, A., Guo, M., Alahi, A., Yeung, S., Luo, Z., Rege, A., Jopling, J., Downing,L., Beninati, W., Singh, A., et al.: Towards vision-based smart hospitals: A systemfor tracking and monitoring hand hygiene compliance. In: Proceedings of MachineLearning for Healthcare. vol. 68 (2017)6. Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpointinvariant 3d human pose estimation. In: ECCV. pp. 160–177. Springer (2016)7. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969(2017)8. Kadkhodamohammadi, A., Gangi, A., de Mathelin, M., Padoy, N.: Articulated clin-ician detection using 3d pictorial structures on rgb-d data. Medical image analysis , 215–224 (2017)uman Pose Estimation on Low-Resolution Depth Images 99. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755.Springer (2014)10. Ma, A.J., Rawat, N., Reiter, A., Shrock, C., Zhan, A., Stone, A., Rabiee, A., Griffin,S., Needham, D.M., Saria, S.: Measuring patient mobility in the icu using a novelnoninvasive sensor. Critical care medicine (4), 630 (2017)11. Maier-Hein, L., Vedula, S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann,M., Feussner, H., Forestier, G., Giannarou, S., et al.: Surgical data science: enablingnext-generation surgery. Nature Biomedical Engineering , 691–696 (2017)12. Padoy, N.: Machine and deep learning for workflow recognition during surgery.Minimally Invasive Therapy & Allied Technologies (2), 82–90 (2019)13. Rodas, N.L., Barrera, F., Padoy, N.: See it with your own eyes: markerless mobileaugmented reality for radiation awareness in the hybrid room. IEEE Transactionson Biomedical Engineering (2), 429–440 (2017)14. Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dy-namics of learning in deep linear neural networks. arXiv:1312.6120 (2013)15. Shi, W., Caballero, J., Husz´ar, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert,D., Wang, Z.: Real-time single image and video super-resolution using an efficientsub-pixel convolutional neural network. In: CVPR. pp. 1874–1883 (2016)16. Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A.,Cook, M., Moore, R.: Real-time human pose recognition in parts from single depthimages. Communications of the ACM (1), 116–124 (2013)17. Srivastav, V., Issenhuth, T., Abdolrahim, K., de Mathelin, M., Gangi, A., Padoy,N.: Mvor: A multi-view rgb-d operating room dataset for 2d and 3d human poseestimation. In: MICCAI-LABELS workshop (2018)18. Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy,N.: Multi-stream deep architecture for surgical phase recognition on multi-viewrgbd videos. In: M2CAIMICCAI workshop (2016)19. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking.In: ECCV. pp. 466–481 (2018)20. Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures ofparts. IEEE transactions on pattern analysis and machine intelligence (12),2878–2890 (2012) ** Supplementary Material *** The pictures below show additional qualitative results of our proposed mod-els, namely DepthPose 80x60 and DepthPose 64x48, w.r.t the baseline modelsRTPose 640x480, RTPose 80x60, and RTPose 64x48. We also show the groundtruth (GT) on color images for better appreciation of the qualitative results.These results show that DepthPose 80x60 and DepthPose 64x48 perform betterfor removing false positives and spurious detections and improve the part local-ization (see red and green arrows in the figures). (a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 4 uman Pose Estimation on Low-Resolution Depth Images 11(a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 5 (a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 6
Fig. 7 (a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 8 uman Pose Estimation on Low-Resolution Depth Images 13(a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 9 (a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48
Fig. 10
Fig. 11 (a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48(a) RTPose 640x480 (b) RTPose 80x60 (c) RTPose 64x48(d) GT (e) DepthPose 80x60 (f) DepthPose 64x48