[PDF] A Deep Neural Network Tool for Automatic Segmentation of Human Body Parts in Natural Scenes

Abstract

This short article describes a deep neural network trained to perform automatic segmentation of human body parts in natural scenes. More specifically, we trained a Bayesian SegNet with concrete dropout on the Pascal-Parts dataset to predict whether each pixel in a given frame was part of a person's hair, head, ear, eyebrows, legs, arms, mouth, neck, nose, or torso.

Full PDF

AA DEEP NEURAL NETWORK TOOL FOR AUTOMATICSEGMENTATION OF HUMAN BODY PARTS IN NATURAL SCENES

Patrick McClure ∗ , Gabrielle Reimann , Michal Ramot , and Francisco Pereira Machine Learning Team, National Institute of Mental Health Section on Cognitive Neuropsychology, National Institute of Mental Health A BSTRACT

This short article describes a deep neural network trained to perform automatic segmentation ofhuman body parts in natural scenes. More speciﬁcally, we trained a Bayesian SegNet with concretedropout on the Pascal-Parts dataset to predict whether each pixel in a given frame was part of aperson’s hair, head, ear, eyebrows, legs, arms, mouth, neck, nose, or torso.

Our deep neural network (DNN) tool was built to segment human body parts (the hair, the head, the left and right arms,the left and right eyes, the left and right eyebrows, the left and right legs, the left and right arms, the mouth, the neck,and the torso) from images of natural scenes. The goal in building this tool was to enable fast, automatic segmentationof human body parts from video frames, for enabling the analysis of human eye tracking data while watching thosevideos. The code is available at https://github.com/nih-fmrif/MLT_Body_Part_Segmentation . We trained and tested the DNN using the training and test sets from the Pascal-Parts [1, 2], respectively. Only theimages containing the "person" object were used, and the part labels were remapped so combine left and right labels forthe same body part (see Table 1). During training, the dataset was augmented by random horizontal ﬂipping, contrast,saturation, brightness, and hue of training sets images, as implemented in the torchvision transforms package . We used PyTorch [3] and the Pascal-Parts training dataset to train a Bayesian SegNet [4], with concrete dropout [5] atthe center and output layers, to perform automatic segmentation of human body parts from natural scenes. The detailedDNN architecture is shown in Table 2. ∗ Corresponding author’s e-mail address: [email protected] https://pytorch.org/docs/stable/torchvision/transforms.html a r X i v : . [ c s . C V ] S e p able 1: Segmented parts names and labels Pascal Part Name New Part Name New Part LabelHair Hair 1Head Head 2Left Ear Ear 3Left Eye Eye 4Left Eyebrow Eyebrow 5Left Foot Leg 6Left Hand Arm 7Left Lower Arm Arm 7Left Lower Leg Leg 6Left Upper Arm Arm 7Left Upper Leg Leg 6Right Ear Ear 3Right Eye Eye 4Right Eyebrow Eyebrow 5Right Foot Foot 6Right Hand Arm 7Right Lower Arm Arm 7Right Lower Leg Leg 6Right Upper Arm Arm 7Right Upper Leg Leg 6Mouth Mouth 8Neck Neck 9Nose Nose 10Torso Torso 11Non-person objects Background 0Background Background 0

Table 2: Bayesian SegNet Architecture

Layer Kernel Size

Example segmentation results for movie frames, not from the Pascal-Parts dataset are shown in Figure 1. We tested thetrained model on the subset of the Pascal-Parts test set containing "person" objects, and evaluated the results using theDice score T P T P + F P + F N for each class. In this measure, a true positive (TP) is a correctly labelled pixel of that class, atrue negative is (TN) is a correctly labelled pixel not belonging to that class, and false positive (FP) and false negative(FN) are the two possible mislabellings. The Dice scores for each segmentation class are shown in Table 3.2 a)(b)(c)(d)

Figure 1: Example visualizations of human body part segmentations for movie frames.3able 3: Pascal-Parts Test Set PerformancePart Name DiceHair 0.58Head 0.60Ear 0.54Eye 0.62Eyebrow 0.60Leg 0.38Arm 0.50Mouth 0.62Neck 0.52Nose 0.57Torso 0.54Background 0.95

The overall performance of the segmentation network is adequate for our purposes, given the sheer number of videoframes in a typical movie. Over many frames, the average number ﬁxations to each body part should be a robustestimate. Segmentation quality can be reduced in the presence of occlusion (see Figure 1c) and when faces are presentedin a frontal view (see Figure 1d). We make this tool available in the hope that it will be helpful to other researchers, andwithout explicit performance guarantees in any setting.

References [1] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visualobject classes (voc) challenge.

International journal of computer vision , 88(2):303–338, 2010.[2] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can:Detecting and representing objects using holistic models and body parts. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1971–1978, 2014.[3] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learninglibrary. In

Advances in Neural Information Processing Systems , pages 8024–8035, 2019.[4] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolu-tional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 , 2015.[5] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In