[PDF] Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation

Abstract

Human pose estimation using deep neural networks aims to map input images with large variations into multiple body keypoints which must satisfy a set of geometric constraints and inter-dependency imposed by the human body model. This is a very challenging nonlinear manifold learning process in a very high dimensional feature space. We believe that the deep neural network, which is inherently an algebraic computation system, is not the most effecient way to capture highly sophisticated human knowledge, for example those highly coupled geometric characteristics and interdependence between keypoints in human poses. In this work, we propose to explore how external knowledge can be effectively represented and injected into the deep neural networks to guide its training process using learned projections that impose proper prior. Specifically, we use the stacked hourglass design and inception-resnet module to construct a fractal network to regress human pose images into heatmaps with no explicit graphical modeling. We encode external knowledge with visual features which are able to characterize the constraints of human body models and evaluate the fitness of intermediate network output. We then inject these external features into the neural network using a projection matrix learned using an auxiliary cost function. The effectiveness of the proposed inception-resnet module and the benefit in guided learning with knowledge projection is evaluated on two widely used benchmarks. Our approach achieves state-of-the-art performance on both datasets.

Full PDF

11 Knowledge-Guided Deep Fractal Neural Networksfor Human Pose Estimation

Guanghan Ning,

Student Member, IEEE,

Zhi Zhang,

Student Member, IEEE, and Zhihai He,

Fellow, IEEE

Abstract —Human pose estimation using deep neural networksaims to map input images with large variations into multiplebody keypoints which must satisfy a set of geometric constraintsand inter-dependency imposed by the human body model. Thisis a very challenging nonlinear manifold learning process in avery high dimensional feature space. We believe that the deepneural network, which is inherently an algebraic computationsystem, is not the most effecient way to capture highly so-phisticated human knowledge, for example those highly coupledgeometric characteristics and interdependence between keypointsin human poses. In this work, we propose to explore howexternal knowledge can be effectively represented and injectedinto the deep neural networks to guide its training process usinglearned projections that impose proper prior. Speciﬁcally, weuse the stacked hourglass design and inception-resnet moduleto construct a fractal network to regress human pose imagesinto heatmaps with no explicit graphical modeling. We encodeexternal knowledge with visual features which are able tocharacterize the constraints of human body models and evaluatethe ﬁtness of intermediate network output. We then inject theseexternal features into the neural network using a projectionmatrix learned using an auxiliary cost function. The effectivenessof the proposed inception-resnet module and the beneﬁt in guidedlearning with knowledge projection is evaluated on two widelyused human pose estimation benchmarks. Our approach achievesstate-of-the-art performance on both datasets.

Index Terms —Human Pose Estimation, Fractal Networks,Knowledge-Guided Learning.

I. I

NTRODUCTION T HE task of human pose estimation is to determine theprecise pixel locations of body keypoints from a singleinput image [1]–[7]. Closely-related tasks include 3D humanpose estimation [8] and human pose estimation in videos[9], [10]. Human pose estimation is very important for manyhigh-level computer vision tasks, including action and activityrecognition [11]–[13], semantic content retrieval [14], human-computer interaction, motion capture [15], and animation. Es-timating human poses from still images is a challenging task.An effective human pose estimation system must be able tohandle large pose variations, changes in clothing and lightingconditions, severe body deformations, heavy body occlusions[16]–[18]. A key question for addressing these problems ishow to extract strong low and mid-level appearance featurescapturing discriminative as well as relevant contextual infor-mation and how to model complex part relationships allowingfor effective yet efﬁcient pose inference. Traditional methodsfor pose estimation are mostly based on Pictorial Structure(PS) models [19]–[24], which models the spatial relations ofrigid body parts using a tree model. A major drawback of suchmodels is the need to hand-design the structure of the model

Input ImageExternal Knowledge

Representation

Learned Projection

DeepNeuralNetwork

Human Pose

Training [ON]

Testing [OFF]

Fig. 1.

Knowledge projection for guided learning . We encode externalknowledge visual features which characterizes the constraints of human bodymodels and then inject these external features into the neural network using aprojection matrix learned using an auxiliary cost function, which is removedduring testing, therefore not increasing network complexity. in order to capture important problem-speciﬁc dependenciesamongst the different output variables and at the same timeallow for tractable inference.With Convolutional Neural Networks (ConvNets) and manyassistive methods such as batch normalization [25], resnet[26], and inception design [27], [28], human pose estimationhas recently achieved signiﬁcant progress. Even though deepneural networks are capable of ﬁtting large training datathrough extensive training, the network often needs to beconstructed deeper and wider to gain enough representationpower [29]. As the network becomes more complex, thelearning and training processing become more sophisticatedand challenging [30], especially for those applications withcomplicated loss functions.Human pose estimation using deep neural networks requiresus to map the input images with large variations into mul-tiple body keypoints which must satisfy a set of geometricconstraints and interdependence imposed by the human bodymodel. This is a very challenging nonlinear manifold learningprocess in a very high dimensional feature space. We believethat the deep neural network, which is inherently an algebraiccomputation system, is not the most efﬁcient way to capturehighly sophisticated human knowledge, for example thosehighly coupled geometric characteristics and interdependencebetween keypoints in human poses.In this work, we propose to explore how external knowledgecan be effectively represented and injected into the deep neuralnetworks to guide its training process using learned projections a r X i v : . [ c s . C V ] A ug for more accurate and robust human pose estimation. Speciﬁ-cally, as illustrated in Fig. 3, we use inception-resnet moduleand the stacked hourglass structure to construct a fractalnetwork to regress human pose images into heatmaps with noexplicit graphical modeling. We encode external knowledgewith visual features which characterize the constraints ofhuman body models and evaluate the ﬁtness of intermediatenetwork output. We then inject these external features intothe neural network using a projection matrix learned usingan auxiliary cost function. The guidance from the externalknowledge is only used during the training process, and isturned off during network inference for human pose esti-mation. The beneﬁt of external knowledge is to guide thetraining of the neural network. Its effect is implicitly imposedon the tuning of the parameters, instead of explicit featurerepresentation of the network. The injected features for pairsof limbs impose a strong prior during the training, preventinghuman part keypoint from connecting to noises, e.g., keypointfrom other people in the background that is not cropped outfor the target person.The major contributions of this work are summarized asfollows: (1) We develop a new framework to represent andproject human knowledge to guide the training of deep neuralnetworks for human pose estimation. This external knowledgeproject framework is generic and can be extended to otherlearning and training applications and deep neural networkdesign. (2) We propose an efﬁcient network structure, called fractal networks , for human pose estimation to capture themulti-scale interdependence between body joints in the posemodel. This fractal network uses an inception-resnet module as the building block.The rest of the paper is organized as follows. In sectionII, we provide a brief review of recent works on human poseestimation. Section III introduces the concept of knowledgeguided learning, the structure of fractal network, and thedesign of inception-resnet module. Section V presents ourexperimental results. Section VI concludes our paper.II. R ELATED W ORK

A. Structured Prediction and Graphical Models

Prior to the advent of neural networks most previous workwas based on pictorial structures [31] which model the humanbody as a collection of rigid templates and a set of pairwisepotentials taking the form of a tree structure, thus allowing forefﬁcient and exact inference at test time. Higher knowledgeof the human body is exploited by modeling humans withbody parts that are connected via a skeleton structure. Pictorialstructure model [31], [32], models the spatial relations ofrigid body parts using a tree model. A pre-deﬁned kinematicbody model is often used to assume that each body part isindependent of all the others except for the ones it is attachedto. A major drawback of such models is the need to hand-design the structure of the model in order to capture importantproblem-speciﬁc dependencies amongst the different outputvariables and at the same time allow for tractable inference.Recent work includes sophisticated extensions like mixture,hierarchical, multimodal and strong appearance models [19], [20], [22], [33], [34], non-tree models [23], [24] as wellas cascaded/sequential prediction models like pose machines[35]. While in [31] each limb is represented by a singletemplate that is parameterized by location, orientation, shapeparameters, and an appearance model, Yang and Ramanan[33] propose mixtures of part templates where body part isrepresented by a set of deformable part templates. Althoughthis approach performs well in comparison to classical picto-rial structure models for human pose estimation, it has somelimitations. For instance, the used scanning-window templatestrained with linear SVMs and HOG features [36] are verysensitive to noise [37]. Hierarchical models [21], [22] representthe relationships between parts at different scales and sizesin a hierarchical tree structure. The underlying assumptionof these models is that larger parts (that correspond to fulllimbs instead of joints) can often have discriminative imagestructure that can be easier to detect and consequently helpreason about the location of smaller, harder-to-detect parts.On the other hand, there are non-tree models [23], [24] toincorporate interactions that introduce loops to augment thetree structure with additional edges that capture symmetry,occlusion and long-range relationships. These methods usuallyhave to rely on approximate inference during both learning andat test time.

B. Deep Neural Networks for Human Pose Regression

ConvNets have been shown to produce remarkable per-formance for a variety of difﬁcult Computer Vision tasksincluding detection [38], [39], recognition [26], [40], andsemantic segmentation [41]. A key feature of these approachesis that they integrate non-linear hierarchical feature extractionwith the classiﬁcation or regression task in hand being alsoable to capitalize on very large data sets that are now readilyavailable.Since the work of

DeepPose by Toshev et al. [16], researchon human pose estimation has shifted from traditional ap-proaches to deep neural networks (DNN) due to their superiorperformance. In the context of human pose estimation, it isnatural to formulate the problem as a regression one in whichCNN features are regressed in order to provide joint predic-tions of the body parts [10], [16], [42]. For the case of non-visible parts, learning the complex mapping from occludedpart appearances to part locations is hard and the network hasto rely on contextual information provided by other visibleparts to infer the occluded part locations. DeepPose uses adeep neural network to directly regress the coordinates of bodyjoints. Tompson et al. [17] argued that it is more efﬁcient touse DNN to regress heatmap images at multiple scales. Whilebody models are not a necessary component for effective partlocalization, constraints between parts allow us to assembleindependent detections into a body conﬁguration. Detection-based methods are relying on powerful CNN-based part detec-tors which are then combined using a graphical model [17],[43] or reﬁned using regression [44], [45]. Regression-basedmethods try to learn a mapping from image and CNN featuresto part locations. [43] achieved promising results by combiningCNN-based body part detectors with a body model [33].

Inception-resnet Module SamplingPixel-wise AddImproved hourglass network

Convolution Batch Norm Relu UnitConcatenateChannels = 3 <3> k=1x1 s=2 <64> <64> <128> /2 <128> <128> <128> <128> <128> <256> <256> <512> 4 <512> k=1x1 <512> <512> k=1x1 <256> <384> k=1x1 <384> <384> <512> 4 <512> k=1x1 <512> <512> k=1x1 <256> <256> k=1x1 <256> k=1x1 k=1x1 <384> Fig. 2.

Overview of fractal network . The network is fractal in that it reﬂects the concurrence of inception and residual design at both the highest and lowest(inception-resnet module) levels of abstractions. At top level, images of size × are down-sampled into the resolution of × . Subsequently, inputsand outputs of all modules are of size × , including the output heatmaps. The numbers within brackets in each module denote the number of input andoutput channels, respectively. Human pose estimation methods using deep neural net-works have proven their signiﬁcant advantages over traditionalapproaches. However, deeper and wider networks are oftenrequired to improve the feature representation power, whichin turn leads to increased difﬁculty in training the neuralnetworks. Recently, residual learning [26] has been used tosigniﬁcantly improve the performance of human pose estima-tion [18], [46]. It was used for part detection in the systemof [46]. stacked hourglass network of [18] elegantly extendsfully convolutional networks [47] and deconvolution nets [48]with residual learning.Intermediate supervision [49], recursive prediction [50],and inception design [27], [28] are among other successfultechniques that have been applied by recent methods forhuman pose estimation. Recently, researchers recognize thatsuccessive predictions can boost the performance of poseestimation, where parts are sequentially reﬁned [16], [35],[51], [52]. In these models an initial prediction is made of allthe parts; in subsequent steps, all part predictions are reﬁnedbased on the image and earlier part predictions. Tompson etal. [44] use a cascade of networks for reﬁned predictions toachieve signiﬁcantly improved precision in joint localization.Carreira et al. [52] introduce a so-called

Iterative ErrorFeedback scheme, where a set of predictions is included in theinput, and each pass through the network further reﬁnes thesepredictions. Their method requires multi-stage training andthe weights are shared between iterations. Recently, addingsupervision to intermediate layers of deep networks is alsoexplored to assist the training process [27], [53]. Methodsin [18], [46], [50], [51] use intermediate supervision to addauxiliary supervision branches in the network to assist thetraining process for human pose estimation. These approachesall employ the inception design by concatenating heatmapsfrom different stages or abstract levels as the input for thenext layers.One direction for further improvement of human poseestimation is to design convolutional networks that can pro-duce robust visual features. Multi-scale processing by repet- itive down-sampling and up-sampling has been introducedin

Stacked Hourglass Networks [18]. Another approach toimprove human pose estimation performance is to use explicitpart-based models [23], [24], [33] or implicitly encode conﬁg-uration model using its contexts [54]. These methods involveadditional sub-networks to detect parts, which increases theoverall complexity. In this work, we leverage these ideasand approaches. We propose a fractal network structure usinginception-resnet as building blocks to explore the multi-scaleinterdependence nature of human pose conﬁguration and tocapture these characteristics across different scales and resolu-tions. The network is fractal in that it reﬂects the co-occurranceof inception and residual design at both the highest and lowestlevels of abstractions.

C. Transfer Learning and Guided Training

Nevertheless, training such deep networks has proven tobe challenging [55]. Signiﬁcant efforts has been devoted toalleviate this problem. For instance, there has been anotherline of work in which a student network is trained from scratchto mimic the behavior of a much larger teacher network.Staring from Bucila et al.s work [56] and Hinton et al.smore general Knowledge Distillation (KD) [57] approach, theknowledge transfer in learning process has gained a lot ofresearch interest. In this paper, we consider a unique settingof the problem. Instead of transferring knowledge from teachernetworks into a student network, we propose an externalknowledge representation and projection framework to guidethe training process of our deep neural network for human poseestimation. Speciﬁcally, we inject hand-designed features thatare inferred from ground truth as external knowledge to aidthe training of a highly complex network with deep structureand multiple loss functions. Inspired by [58], which proposeda locality principle to learn task-speciﬁc feature mapping forshape regression, we project the external knowledge witha learned feature mapping. The procedure involves domainadaptation and model training simultaneously. Since external

GNet

Predicted Heatmaps

Predict Poses

TrainTest

Ground-truth Heatmaps

Converting Poses

External Knowledge

Inferring Features

LearnedProjections

Learn Linear ProjectionsCompute GradientsLoss 1

Loss 2Compute Gradients

GNet

Predict Poses 3D-NMS

Forward Gradient from Loss 1 Gradient from Loss 2 W Fig. 3.

Framework of our proposed Guided Network (GNet) . Theprojected knowledge affects the gradients propagated back to convolutionallayers but they are not part of the network during deployment. By enforcingconstraints with external knowledge injection, high-level information of long-range dependencies between image and multi-part cues that is hard to capturewith implicit learning can be better learned under the guidance of mid-levelknowledge projection. knowledge is inferred from ground truth, it is inherently morereliable and effective than the outputs from a teacher network.III. P

ROPOSED M ETHOD

A. Network Structure and Design

Human pose estimation methods using hand-crafted featuresor graphical structure models based on human knowledge lackthe ﬂexibility in learning and the potential to achieve greatrepresentation power. On the other hand, pure data-drivenneural networks may not be able to capture sophisticatedknowledge involved in human pose estimation. In this work,we propose to represent and inject external human knowledgeto guide the learning of deep neural networks (DNN), asillustrated in Fig. 1. Our major idea is that, by enforcingconstraints and guidance with external knowledge injection,high-level information of long-range dependencies betweenimage and multi-part cues, that are hard to capture withimplicit learning, can be better learned under the guidanceof mid-level knowledge projection. As shown in Fig. 3, theprojected knowledge affects the gradients propagated back toconvolutional layers during training, but they are not part ofthe network during test.We borrow the ideas from inception-residual networks [59]and propose to construct a basic inception-resnet module inreplacement of convolutional layers for more robust featurerepresentation. Hourglass network is ﬁrst introduced in [18]where features are processed across all scales by repetitvedown-sampling and up-sampling and then consolidated to bestcapture various spatial relationships associated with the body.We introduce a modiﬁed version of the hourglass network withthe proposed inception-resnet module. As shown in Figure 2,we use proposed inception-resnet modules and improved hour-glass sub-networks to construct a fractal network to regresshuman pose images into heatmaps with no explicit graphicalmodeling. The network is fractal in that it has the same network conﬁguration at all levels of analysis and abstrac-tions. This fractal network is designed to capture the multi-scale interdependence nature of human pose conﬁguration andto represent these characteristics across different scales andresolutions. In the inception network, we perform channel-wise concatenation of two tensors from different sources. Thisenforces the information represented by the features stored inthese tensors to be complementary to each other. It encouragesand directs these two sources to work on different conceptsto produce a more robust union representation [26]. In theResnet model, we perform pixel-wise addition of two tensorswith the same number of channels. From our experiments, weﬁnd that this network design allows us to train the networkmore effectively, since it enforces two separate tensors tobe simultaneously accurate in order to render the expectedoutputs.

B. Fractal Network with Inception-resnet Modules

Our motivation in the fractal network design is that, we needthe network to focus on various scales across human parts,and at each scale, the network should also have an overallunderstanding of this receptive ﬁeld. At higher levels, thenetwork captures dependencies among various human parts.At lower levels, we use same fractal design to capture regionaldependencies. It is essential to capture local dependenciesin addition to local appearances. Because at a certain high-level scale, the receptive ﬁeld may involve a human part aswell as noises from other parts. These adjacent parts may befrom the same or other persons. Therefore, local dependenciesare helpful in providing more reliable features to higher-levelnetworks.The construction of inception-resnet module is shown inFigure 4. Based on the hourglass design proposed in [18]shown in Figure 5, an improved version of hourglass networkis developed in this work as a mid-level sub-network whichalso uses inception-resnet modules as the basic units, asillustrated in Figure 6. To combine the advantages of bothinception and resnet design, we introduce the inception-resnetmodule as the basic building block to analyze local ﬁelds,while using an improved hourglass network to capture theglobal information of different parts.At the bottom level, we propose to use inception-resnetmodule as the basic structure unit of the network. It consistsof convolutional layers, batch norm layers and relu units, withchannel-wise concatenation and pixel-wise additions. Convo-lution layers are padded such that the resolution of output is thesame as that of the input. Although the concatenation of twobranches maintains different level of information, the concate-nated features across different channels need to be transformedand normalized by the subsequent convolutional layers. In theproposed inception-resnet module, the concatenation layer isfollowed by another convolutional layer with × kernels.The beneﬁt of this module is that the input and output havethe same resolution while the depth of channels can be ﬂexible.At the sub-network level, we implement the recursive hour-glass for levels as shown in Figure 5. In other words, itwill process the image at four scales. The hourglass network Channels = M k=1x1 k=1x1 k=3x3 k=3x3 k=3x3 k=1x1 k=1x1

Channels = N

Convolution Batch Norm Relu Unit Concatenate Pixel-wise Add

Fig. 4.

Basic module: Inception-resnet . Convolution layers are padded such that the resolution of output is the same as that of the input. The beneﬁt of thismodule is that the input and output are of uniform resolution while the depth of channels can be changed. The function of this basic module is to interpretthe input information from one form to another, extracting features for another abstraction level with little loss of information quantity.Fig. 5. An illustration of hourglass design proposed in [18]. Pixel-wiseaddition fuses the information from two branches while keeping the inputand output resolution uniform. The illustration gives an example of a 4-levelhourglass. is nested in itself. The ﬁrst level of hourglass in our networkis an inception-resnet module. As illustrated in Figure. 6, wealso borrow the idea of hourglass design by down-samplingand then up-sampling the data while using inception-resnet asproposed common building block. Pixel-wise addition fusesthe information from two branches while keeping the inputand output resolution the same.At the top fractal network level, images of size × are down-sampled into the resolution of × . Subsequently,inputs and outputs of all modules are of size × , includingthe output heatmaps. The network captures and consolidatesinformation across all scales of the image. C. External Knowledge Representation

The fractal network is used to boost the data representationpower of the deep neural network for human pose estimation.As the network grows deeper and more complicated, it requirescareful attention to the training process. Furthermore, we rec-ognize that the deep neural network is inherently an algebraiccomputing system, which might not be the most efﬁcient wayto capture the highly sophisticated human knowledge duringpose estimation, for example those highly coupled geometricconstraints and interdependence among body joints. To addressthese two issues, in this work, we propose to encode andinject external knowledge into the fractal network to guidethe training process of the network using learned projections,enforcing a prior during the training process. In this work, we propose to inject the geometric representa-tion of knowledge into the heatmap layer of the network. Sincethe heatmaps to be predicted are correlated to each other asthey largely share parameters on former layers, the constrainton one heatmap inﬂuences the parameters of these layers andtherefore having an impact on the training of other heatmaps.We observe that intermediate layers in our network are low andmid-level visual features; higher-level semantic features arehard to locate and explicitly interpret. The predicted heatmapsare easier to enforce the external knowledge and constraintsupon. During the training process, the external knowledge andits visual representations are projected into the background andkeypoint heatmaps using a projection matrix. We ﬁnd that thistype of knowledge-guided learning inherently enforces long-range dependencies and conﬁgurations among human joints,while leaving the ﬂexibility of representation to the depth ofthe network, the quality and quantity of training data. In thefollowing, we explain the proposed method in more detail.During the training process, the external knowledge rep-resentation module illustrated in Figure. 1 has access tothe original training sample image and its ground-truth jointlocations.Speciﬁcally, during feature mapping which is denoted as Φ , we perform Hough Transform on each line traversing twoseparate joints denoted as ( u i , v i ) and ( u j , v j ) . In Houghspace, each line is represented by a coordinate ( θ, ρ ) .  θ = arctan( x i − x j y j − y i ) ρ = x j × cos( θ ) + y j × sin( θ ) (1)In order to represent the information in a less crisp manner,we convert the coordinates into a normalized vector repre-sentation. To incorporate the inherent learning of geometricalfeatures such as angles and distance, we also inject the jointlocations alongside each line. Based on the visibility of eachjoint, the line traversing it is encoded with the number ofvisible joints.In addition to encoding geometric features, we encodeimage descriptors such as Histogram of Gradients (HOG)

Channels = M k=1x1 k=1x1 k=3x3 k=3x3 k=3x3 k=1x1 k=1x1

Channels = N

Convolution Batch Norm Relu Unit Concatenate Pixel-wise Add /2 x2

Channels = M <256> <256> <256> <256> <256> <256> <256> <256> <256> <256>

Channels = NInception-resnet Module Sampling Pixel-wise AddHourglass network at lower level

Fig. 6.

Improved hourglass sub-network . While using inception-resnet as proposed common building block, we borrow the idea of hourglass design bydown-sampling and then up-sampling the dataﬂow in one branch, maintaining the resolution of the other branch. The lowest level of the recursive hourglassin our network is an inception-resnet module. :Concat Mask

Fig. 7. We encode image descripors around each pair of adjacent jointsin order to capture visual features to compensate spatial dependencies.The features are concatenated and normalized for the external knowledgerepresentation. [36] around each pair of adjacent joints in order to capturevisual features to compensate spatial dependencies. While wepreserve the ﬂexibility of deep convolutional features that auto-matically learn visual semantics, we use hand-crafted featuresas guidance of the learning by enforcing a strong prior duringthe training of the neural network. We noticed that humanjoints may connect to those from an adherent person, eventhough the ground truth joints are not self-occluded or object-occluded. We believe HOG features are helpful in observingedges and therefore distinguishing real and false limbs. Theinjected features for pairs of limbs impose a strong prior duringthe training, preventing human part keypoint from connectingto noises, e.g., keypoint from other people in the backgroundthat is not cropped out for the target person, which is helpfulin the learning of body part interdependencies. As illustratedin Figure 7, the features are concatenated and normalizedfor the external knowledge representation. For self-occludedand object-occluded joints, we mask corresponding featureswith zeros. Speciﬁcally, we follow the traditional HOG featureextraction schemes, applying ﬁlters D x = [ − and D y = [1 0 − T horizontally and vertically to generategradient maps I x and I y . Instead of scanning a window forblocks and cells over the image which is done in traditionalways, we locate limbs based on meta-data from the trainingset and extract histogram of gradients for such regions. Themagnitude and orientation of the gradient are respectivelycomputed by: | G | = (cid:113) I x + I y (2)and ϕ = arctan I y I x (3) We use bins for the pooling, followed by block normalization(L2-norm) to mitigate the effect of unbalanced area of regions: f = v (cid:112) || v || + e (4)Where e is a very small number. D. Knowledge Projection into the Deep Neural Network

In favor of decoding the abstract external knowledge inhigher-dimensional space, we afford 2 fully-connected (FC)layer and 3 convolutional layers for the for geometric featuresand edge features, between the projection representation andthe injected knowledge to learn linear projection W , whichwill be removed during testing as it is undesirable to keepredundant layers.We inject external features as knowledge K via globalfeature mapping function Φ and learn a global linear projection W by minimizing the loss from the knowledge projectionlayer: L KP = || K − W × H J || + β × || W || (5)where the ﬁrst term is the regression target, the second termis a L regularization on W , and β controls the regularizationstrength. Regularization is necessary because the dimension-ality of the features is very high. Since the objective functionis quadratic with respect to W , we can always reach its globaloptimum [58].Speciﬁcally, we enforce two loss functions, one for injectedgeometric features and one for limb-wise edge features. (1)The ground truth heatmap is convolved by 1x1 kernels,outputting 8 channels of maps. It is padded such that theresolution does not change. A fully connected layer with anoutput of 224-dimensional geometric feature is added to theconvolutional layer. We add L2 loss (weighted by 0.05) for thegeometric features and the inferred features from the groundtruth. (2) We branch out the 3rd inception-residual module atthe early stage and feed its output to a series of convolutionallayers with 1x1 kernels. The numbers of output channels arescaled twice by a factor of 1/2 until it reaches 32 channels,followed by a fully connected layer. We add L2 loss (weightedby 0.05) for injected edge features and the inferred edgefeatures. We denote the pixel location of the j -th anatomical land-mark (which we refer to as human joint), Y j ∈ Z ⊂ R ,where P is the set of all ( u, v ) pixel locations in imagecoordinate system. Our goal is to predict the image locations Y = ( Y , ...Y J ) for all J joints. The output heatmaps are ofsize J × × , denoted by H = ( H , ..., H J ) , which arepredicted beliefs for assigning a pixel location to each joint Y j = p, ∀ z ∈ P , producing belief scores S j for all pixels inthe heatmap of joint j : H j ( p ) ← S j ( Y j = p ) (6)In our experiments, we regress RGB-channel images intoa set of heatmaps, of which are human joints whilethe other one as the background. The heatmaps are thensuppressed into joint locations Y with our proposed 3D-NMS algorithm specially designed for human pose estimation.During training, we provide ground truth heatmaps for eachjoint by creating Gaussian peaks at ground truth locations. Thecost function L f we aim to minimize for the fractal networkis given by: L f = (cid:88) j ∈ J || H j ( p ) − H ∗ j ( p ) || (7)The overall loss for training is a weighted combinationof heatmap cost and projection matrix ﬁtness provided byknowledge-guided learning, with a control parameter on howmuch guidance should be imposed. The overall network isthen trained to minimize the following joint loss function: W ∗ f ← arg min W f ( λ × L KP + (1 − λ ) × L f ) (8)where L KP and L f are loss from knowledge projectionlayer and the fractal network loss, λ is the weight parameterdecaying during training, and W ∗ f is the trained parameters inthe fractal network.The output of knowledge projection layer will guide thetraining of fractal network by generating a strong and explicitgradient applied to backward path to the injection layer in thefollowing form: ∆ W f,i = − λ · ∂ L KP ∂W f,i (9)Where W f,i is the weight matrix of injection layer in fractalnetwork. Note that the network update only occurs duringtraining. During testing, the knowledge representation andprojection modules are removed. E. Cross-Heatmap Non-Maximum Suppression

In this work, we introduce a novel pose non-maximumsuppression (NMS) algorithm specially designed for humanpose estimation. Our experiments in Section V-D show thatemploying pose-NMS consistently render better predictions forall models across iterations on both MPII [60] and LSP [61]datasets. Instead of ﬁnding the maximum value at pixel-levelto predict joint location as in [18], [50], [51], [54], we detectblobs with high responses in each heatmap. Basically, wegather blobs from all heatmaps for suppression. We ﬁrst ﬁnd the blob with maximum response, then suppress other blobsfrom the same heatmap, and blobs from other heatmaps veryclose to this blob in image coordinate system. We repeat thisprocedure until all blobs are removed. The suppression takesplace in image coordinate system and channel-wise ( u, v, c ) ,therefore called cross-heatmap NMS.IV. S UMMARY OF T RAINING AND T ESTING P ROCEDURES

We summarize our training and testing procedures in Al-gorithm 1 and 2, respectively. There exist around 250 convo-lutional layers in the original hourglass network, while theproposed network with inception-resnet modules consist ofover 300 convolutional layers. The network for training theproposed network has an additional cost with 1 external featureextraction module, 2 fully connected layers, 3 convolutionallayers and 2 additonal loss layers. In our implementation,it takes the hourglass network an average of 47ms to feedforward with a single Pascal TITAN X GPU. In comparison,the feed forward time of the proposed network with inception-resnet modules during testing is 62ms.

Algorithm 1:

Summary of Procedures: Training Phase input :

A set of RGB images I and correspondingground truth joint coordinates J output: Trained weights W ∗ f for the Fractal Network, W ∗ KP for the knowledge projection layers Initialize DNN with fractal network and knowledgeprojection layers; for k epoches do for mini-batch in I do Compute external knowledge representation: K ←− { I n , J n } ; Back-propagate w.r.t W f , W KP ; { W (cid:48) f , W (cid:48) KP } ← arg min W f ( λ × L KP + (1 − λ ) × L f ) ; end end return { W ∗ f , W ∗ KP } Algorithm 2:

Summary of Procedures: Testing Phase input :

A set of RGB images I and a fractal networkwith trained weights W ∗ f output: A set of predicted joint coordinates J in thesame image coordinate system initialize network only with fractal network layers W ∗ f ; while not at end of this image set do Load image I i ; Forward the network: J i ← { W ∗ f , I i } ; end return J V. E

XPERIMENTAL R ESULTS

For comprehensive experimental analysis, we will ﬁrst in-troduce the datasets, evaluation criteria and implementation

Fig. 8. Example output produced by our network. On the top-left we see the ﬁnal pose estimate provided by NMS across all heatmaps. Elsewhere we showsample heatmaps: (1) The ﬁrst row shows the ﬁnal part regression heatmap results; (2) the second row shows the preliminary part regression results from theintermediate supervision layer. The heatmaps from the ﬁrst row have ﬁner predictions than the second row, especially the heatmap for the right foot, wherethe preliminary prediction renders belief scores for the soccer ball as well. details. Then we will present quantitative evaluations onbenchmark datasets. Finally, diagnostic experiments, algorithmperformance analysis and dicussions are provided for furtheranalysis.

A. Datasets and Criteria1) Datasets:

We evaluate the proposed method on twowidely used benchmarks: MPII Human Pose [60] and extendedLeeds Sports Poses (LSP) [61]. The MPII Human Pose datasetincludes about K images with k annotated poses. Theimages are collected from YouTube videos covering dailyhuman activities with highly articulated human poses. The LSPdataset with extended training data consists of 11K trainingimages and 1K testing images from sports activities.

2) Criteria:

There are three criteria used in the experimentsto evaluate the performance of the proposed human poseestimation approach: Percentage of Corrected Parts (PCP)[33], [62], [63], Percentage of Detected Joints (PDJ) [16], [19],[33], and Percentage of Corrected Keypoints (PCK) [33]. a) PCP:

A widely-used criterion for human pose es-timation is PCP which evaluates the localization accuracyof body parts (sticks of skeleton). It requires the estimatedpart end points must be within half of the part length fromthe ground truth part end points. As pointed by Yang andRamanan [33], some previous work requires only the averageof theendpoints of a part to be correct (PCP-average), ratherthan both endpoints (PCP-strict). Moreover, the early PCPimplementation [62] selects the best matched output withoutpenalizing false positives. In all our experiments, we adoptthe strictest measure, i.e., PCP-strict with single output, if notspecially speciﬁed. For more detailed descriptions on PCP, itis recommented to refer to [62] and [33]. b) AUC:

Though PCP is the initially preferred criterionfor evaluation, it has the drawback of penalizing shorter limbs,such as lower arms. Thus PDJ is introduced [16], [19] tomeasure the detection rate of body joints, where a joint isconsidered to be detected if the distance between the detectedjoint and the true joint is less than a fraction of the torsodiameter. The torso diameter is usually deﬁned as the distancebetween opposing joints on the human torso, such as leftshoulder and right hip [16]. The Area Under Curve (AUC)can be used as the overall evaluation of the PDJ curve. In the following experiments, we report AUC as our PDJperformance. c) PCK:

The PCK measure is very similar to the PDJcriterion. The only difference is that the torso diameter is re-placed with the maximum side length of the external rectangleof ground truth body joints. For full body images with extremepose (especially when the torso becomes very small), the PCKmay be more suitable to evaluate the accuracy of body partlocalization.In our experiments, we follow the ofﬁcial benchmark eval-uation protocals . Ofﬁcial benchmark on MPII dataset adoptsPCKh (using portion of head length as reference) at . , whileofﬁcial benchmark on LSP dataset adopts both PCP and PCKat . . LSP benchmark provide comparisons on both Observer-Centric (OC) and Person-Centric (PC) evaluations, of whichthe most widely adopted evaluation protocal is PCK-PC. Inaddition, both benchmarks adopt AUC scores. B. Implementation Details1) Data Augmentation:

We crop the images with the targethuman centered at the images with roughly the same scale,and warp the image patch to the size × . Then, werandomly rotate ( ± ◦ ) and ﬂip the images, perform randomre-scaling ( . to . ) and color jittering to make the modelmore robust to scale and illumination changes.

2) Experimental Settings:

We use a modiﬁed version of

Caffe [64] that produces three kinds of outputs from the datalayer: the augmented image, the corresponding transformedground truth heatmaps, and the injected knowledge for theaugmented image. The knowledge projection is switched offduring testing. We train our model using the initial learningrate of . × − . The parameters are optimized by RMSprop[65] algorithm. We divide the learning rate by 2 when thevalidation set hits plateaus. The minimum learning rate is setto − . We use 4 Pascal TITAN GPUs to train the model onthe merged dataset of MPII and extended LSP for over epochs, and adopt Tompson’s validation split for the MPIIdataset used in [17] to monitor the training process. The samemodel is used for the testing of both MPII and LSP test sets.According to [66], there is a prior towards the background http://human-pose.mpi-inf.mpg.de/ Method Head Sho. Elb. Wri. Hip Knee Ank. TotalHourglass 97.0 93.0 88.8 85.6 92.2 93.0 90.9

Ours (no guidance) 97.9 93.2 89.1 86.4 94.5 93.8 92.9

Ours (with guidance) 98.2 94.4 91.8 89.3 94.7 95.0 93.5

Plain testing 97.4 92.7 88.8 86.7 92.2 93.8 92.2 92.0+ ﬂipping 97.7 93.3 90.4 87.5 93.2 94.2 92.8 92.7+ scaling 98.1 93.7 91.3 88.7 94.0 94.6 93.2 93.4+ 3D-NMS 98.2 94.4 91.8 89.3 94.7 95.0 93.5

TABLE I C OMPONENT ANALYSIS

ON THE

LSP D

ATASET OF

[email protected]

SCORE . N

OTE THAT NUMBERS IN BOLD INDICATE THE METHOD HAS EMPLOYED ALLTECHNIQUES DURING TESTING . that forces the network to converge to zero. It is thereforeimportant to weight the gradient responses so that there isan equal contribution to the parameter update between theforeground and background heatmap pixels. In our trainingprocess, we weight the foreground and background by

20 : 1 .The neural network takes the cropped images patches or ROIof the images as inputs. However, there exists such situationwhere the cropped patches or ROI contains limbs from otherpersons. In this case, our ground truth simply ignores otherlimbs. For example, any region that is not from the keypointsof the target person is considered as background heatmap inthe ground truth. Since the target person is always centered inthe cropped image or ROI, it enforces a prior during training.Therefore, limbs from other persons are usually of lowerresponse, reﬂected by the predicted heatmaps.

3) Inference:

During testing, we follow the standard routineto crop image patches with the given rough position andthe scale of the test human for MPII dataset. For the LSPdataset, we use image size as the rough scale, and imagecenter as the rough position of the target human to crop theimage patches. Before feeding into the neural network, wefurther pre-process images with normalization and pixel-wisesubtraction by estimated mean value. All the experimentalresults are produced from the original and ﬂipped imagepyramids with 2 scales (1 and 0.75). Note that we swapheatmaps of left and right limbs before merging correspondingheatmaps for each joint. The merged heatmaps are transformedinto joint coordinates by the proposed cross-heatmap non-maximum suppression method. The feed forward time of thenetwork during testing is 62ms with a single Pascal TITAN XGPU.

C. Benchmark Evaluation

We use the Percentage Correct Keypoints (PCK) [33] metricfor comparisons on the LSP dataset, and the PCKh measure[60], where the error tolerance is normalized with respectto head size, for comparisons on the MPII Human Posedataset. We train our model by adding the MPII training setto the extended LSP training set with person-centric (PC)annotations, which is a standard routine [45], [46], [50], [51].

1) Results on the MPII Human Pose Dataset:a) AUC:

The AUC score of our network for MPII datasetis . . b) [email protected]: Table II reports the comparison of thePCKh performance of our method and previous state-of-the-art at a normalized distance of . . Our total PCKh-0.5 score achieves state of the art performance at 91.2%. We apply alltechniques described in Section. V-D during testing. Note thatwe test at same multiple scales (1 and 0.75) as that usedon LSP dataset, which may not be ideal. While cropping theimages with the given scale of MPII dataset, for some imagesthe feet are cropped out, therefore suffering a comparativelylower detection rate for ankles.

2) Results on the Leeds Sports Pose Dataset:a) AUC:

The AUC score of our network for LSP datasetis . . D e t e c t i on r a t e , % PCK total, LSP PC Bulat et al., ECCV’16Wei el al., CVPR’16Insafutdinov et al., ECCV’16 .Pishchulin et al., CVPR’16Lifshitz et al., ECCV’16Belagiannis et al., FG’17Yu et al., ECCV’16Rafi et al., BMVC’16Yang et al., CVPR’16Chen&Yuille, NIPS’14Fan et al., CVPR’15Tompson et al., NIPS’14Pishchulin et al., ICCV’13Wang&Li, CVPR’13Ours

Fig. 9. Person-Centric (PC) PCK curves on the LSP test set. Ours is on top. b) [email protected]:

Table III reports the PCK at thresholdof . , and Fig. 9 exhibits PCK over various thresholds.Our approach achieves state-of-the-art performance with PCKvalue of 93.9%, and outperforms all existing methods on eachbody part prediction. c) PCP: Table IV reports the PCP at threshold of . . D. Algorithm Performance Analysis and Ablation Study

Since the ground truth of MPII dataset is not publiclyavailable and it is forbidden to frequently submit MPII testresults to the ofﬁcial, we perform component analysis ofour proposed method on the LSP dataset. We analyze thecontribution of each component in Table I.We compare the proposed inception-resnet module and thebasic resnet module employed by stacked hourglass networks[18]. Since their performance is not reported on LSP dataset,we implement their network within our system to renderfair comparisons. Under identical settings, our network with Method Head Sho. Elb. Wri. Hip Knee Ank. TotalOurs 98.1 96.3 92.2 87.8 90.6 87.6 82.7

Newell et al., ECCV’16 [18] 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9Bulat&Tzimiropoulos, ECCV’16 [54] 97.9 95.1 89.9 85.3 89.4 85.7 81.7 89.7Wei et al., CVPR’16 [51] 97.8 95.0 88.7 84.0 88.4 82.8 79.4 88.5Insafutdinov et al., ECCV’16 [46] 96.8 95.2 89.3 84.4 88.4 83.4 78.0 88.5Raﬁ et al., BMVC’16 [67] 97.2 93.9 86.4 81.3 86.8 80.6 73.4 86.3Gkioxary et al., ECCV’16 [68] 96.2 93.1 86.7 82.1 85.2 81.4 74.1 86.1Lifshitz et al., ECCV’16 [69] 97.8 93.3 85.7 80.4 85.3 76.6 70.2 85.0Pishchulin et al., CVPR’16 [45] 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4Hu&Ramanan, CVPR’16 [70] 95.0 91.6 83.0 76.6 81.9 74.5 69.5 82.4Tompson et al., CVPR’15 [44] 96.1 91.9 83.9 77.8 80.9 72.3 64.8 82.0Carreira et al., CVPR’16 [52] 95.7 91.7 81.7 72.4 82.8 73.2 66.4 81.3Tompson et al., NIPS’14 [17] 95.8 90.3 80.5 74.3 77.6 69.7 62.8 79.6Pishchulin et al., ICCV’13 [34] 74.3 49.0 40.8 34.1 36.5 34.4 35.2 44.1TABLE IIC

OMPARISONS OF

PCK H @0.5 SCORE ON THE

MPII

TEST SET .Fig. 10. Qualitative results on the MPII test set. inception-resnet module achieves superior performance overthat with basic resnet module by improving the accuracyby . %. We also compare our network under standardtraining with the same network under knowledge projectionand guided learning. Results show that better performance isachieved with knowledge guided training with an accuracyimprovement of . %. We then analyze contributions of othertechniques employed mainly during testing, i.e., ﬂipping theimage, testing the image at multiple scales, and using proposedNMS algorithm for pose estimation. Testing on original andﬂipped images improves performance by 0.7%, while testingon both original and . scales further improves performanceby another . %. Cross-heatmap non-maximum suppressionimproves the PCK value by . %. It should be noted that our implementation in PyCaffe [64] may not fully reproduce equivalent performance on MPIIdataset of the hourglass network [18], which is implementedin

Torch [75]. However, we discuss with performance analysisto show that our proposed knowledge guided training is ableto improve the performance on top of existing deep neuralnetwork. We expect that the same performance gain can beachieved on other network structures.VI. C

ONCLUSION

In this work, we have proposed to encode and inject externalhuman knowledge into deep neural networks to guide its train-ing process with learned projections for more effective human Code and models available at: http://github.com/Guanghan/GNet-pose Fig. 11. Qualitative results on the LSP test set.Method Head Sho. Elb. Wri. Hip Knee Ank. TotalOurs 98.2 94.4 91.8 89.3 94.7 95.0 93.5

Bulat&Tzimiropoulos. ECCV’16 [54], 97.2 92.1 88.1 85.2 92.2 91.4 88.7 90.7Wei et al. CVPR’16 [51], 97.8 92.5 87.0 83.9 91.5 90.8 89.9 90.5Insafutdinov et al. ECCV’16 [46], 97.4 92.7 87.5 84.4 91.5 89.9 87.2 90.1Pishchulin et al. CVPR’16 [45], 97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1Lifshitz et al. ECCV’16 [69], 96.8 89.0 82.7 79.1 90.9 86.0 82.5 86.7Belagiannis&Zisserman FG’17 [50], 95.2 89.0 81.5 77.0 83.7 87.0 82.8 85.2Yu et al. ECCV’16 [71], 87.2 88.2 82.4 76.3 91.4 85.8 78.7 84.3Raﬁ et al. BMVC’16 [67], 95.8 86.2 79.3 75.0 86.6 83.8 79.8 83.8Yang et al. CVPR’16 [72], 90.6 78.1 73.8 68.8 74.8 69.9 58.9 73.6Chen&Yuille NIPS’14 [43], 91.8 78.2 71.8 65.5 73.3 70.2 63.4 73.4Fan et al. CVPR’15 [73], 92.4 75.2 65.3 64.0 76.7 68.3 70.4 73.0Tompson et al. NIPS’14 [17], 90.6 79.2 67.9 63.4 69.5 71.0 64.2 72.3Pishchulin et al. ICCV’13 [34], 87.2 56.7 46.7 38.0 61.0 57.5 52.7 57.1Wang&Li et al. CVPR’13 [74], 84.7 57.1 43.7 36.7 56.7 52.4 50.8 54.6TABLE IIIC

OMPARISONS OF

[email protected]

SCORE ON THE

LSP

TEST SET . pose estimation. We adopt the stacked hourglass design andpropose to use inception-resnet as the building block of ourfractal network to regress human pose into heatmaps with noexplicit graphical modeling. Utilizing a multi-resolution fea-ture representation with guided learning, the network learns anempirical set of low and high-level features which are typicallymore tolerant to variations in the training set. Knowledge-guided learning is a generic scheme that can be potentiallyused to aid other deep neural network training tasks. Theeffectiveness of the proposed inception-resnet module andthe beneﬁt in guided learning with knowledge projection is evaluated on two widely used benchmarks.R EFERENCES[1] L. Fu, J. Zhang, and K. Huang, “Orgm: Occlusion relational graphicalmodel for human pose estimation,”

IEEE Transactions on Image Pro-cessing , vol. 26, no. 2, pp. 927–941, 2017. 1[2] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Body partsdependent joint regressors for human pose estimation in still images,”

IEEE transactions on pattern analysis and machine intelligence , vol. 36,no. 11, pp. 2131–2143, 2014. 1[3] X. Zhang, C. Li, W. Hu, X. Tong, S. Maybank, and Y. Zhang, “Humanpose estimation and tracking via parsing a tree structure based humanmodel,”

IEEE Transactions On Systems, Man, And Cybernetics: Systems ,vol. 44, no. 5, pp. 580–592, 2014. 1 Method Torso U.Leg L.Leg U.Arm Forearm Head TotalOurs 98.6 95.8 93.6 90.7 84.2 96.4

Bulat&Tzimiropoulos. ECCV’16 [54], 97.7 92.4 89.3 86.7 79.7 95.2 88.9Wei et al. CVPR’16 [51], 98.0 82.2 89.1 85.8 77.9 95.0 88.3Insafutdinov et al. ECCV’16 [46], 97.0 90.6 86.9 86.1 79.5 95.4 87.8Yu et al. ECCV’16 [71], 98.0 93.1 88.1 82.9 72.6 83.0 85.4Pishchulin et al. CVPR’16 [45], 97.0 88.8 82.0 82.4 71.8 95.8 84.3Lifshitz et al. ECCV’16 [69], 97.3 88.8 84.4 80.6 71.4 94.8 84.3Belagiannis&Zisserman FG’17 [50], 96.0 86.7 82.2 79.4 69.4 89.4 82.1Raﬁ et al. BMVC’16 [67], 97.6 87.3 80.2 76.8 66.2 93.3 81.2Yang et al. CVPR’16 [72], 95.6 78.5 71.8 72.2 61.8 83.9 74.8Chen&Yuille NIPS’14 [43], 96.0 77.2 72.2 69.7 58.1 85.6 73.6Fan et al. CVPR’15 [73], 95.4 77.7 69.8 62.8 49.1 86.6 70.1Tompson et al. NIPS’14 [17], 90.3 70.4 61.1 63.0 51.2 83.7 66.6Pishchulin et al. ICCV’13 [34], 88.7 63.6 58.4 46.0 35.2 85.1 58.0Wang&Li et al. CVPR’13 [74], 87.5 56.0 55.8 43.1 32.1 79.1 54.1TABLE IVC

OMPARISONS OF

[email protected]

SCORE ON THE

LSP

TEST SET . (a)(b) Fig. 12. Failure cases on LSP dataset: (a) Ambiguity caused by full occlusion of 2 or more adjacent body parts; (b) Regression mistake caused by theconcurrence of body part noise from other persons and full occulusion of less than 2 body parts.[4] H. Jiang, “Human pose estimation using consistent max covering,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 33, no. 9,pp. 1911–1918, 2011. 1[5] L. Zhao, X. Gao, D. Tao, and X. Li, “Learning a tracking and estimationintegrated graphical model for human pose tracking,”

IEEE transactionson neural networks and learning systems , vol. 26, no. 12, pp. 3176–3186,2015. 1[6] Q. Li, F. He, T. Wang, L. Zhou, and S. Xi, “Human pose estimation byexploiting spatial and temporal constraints in body-part conﬁgurations,”

IEEE Access , vol. 5, pp. 443–454, 2017. 1[7] M. Eichner and V. Ferrari, “Human pose co-estimation and applications,”

IEEE transactions on pattern analysis and machine intelligence , vol. 34,no. 11, pp. 2282–2288, 2012. 1[8] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic,“3d pictorial structures revisited: Multiple human pose estimation,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 38,no. 10, pp. 1929–1942, 2016. 1[9] F. Zhou and F. De la Torre, “Spatio-temporal matching for human poseestimation in video,”

IEEE transactions on pattern analysis and machineintelligence , vol. 38, no. 8, pp. 1492–1504, 2016. 1[10] T. Pﬁster, J. Charles, and A. Zisserman, “Flowing convnets for humanpose estimation in videos,” in

Proceedings of the IEEE InternationalConference on Computer Vision , 2015, pp. 1913–1921. 1, 2 [11] N. Ikizler-Cinbis and S. Sclaroff, “Web-based classiﬁers for humanaction recognition,”

IEEE Transactions on Multimedia , vol. 14, no. 4,pp. 1031–1045, 2012. 1[12] A. Marcos-Ramiro, D. Pizarro, M. Marron-Romera, and D. Gatica-Perez, “Let your body speak: Communicative cue extraction on naturalinteraction using rgbd data,”

IEEE Transactions on Multimedia , vol. 17,no. 10, pp. 1721–1732, 2015. 1[13] X. Cai, W. Zhou, L. Wu, J. Luo, and H. Li, “Effective active skeletonrepresentation for low latency human action recognition,”

IEEE Trans-actions on Multimedia , vol. 18, no. 2, pp. 141–154, 2016. 1[14] R. Ren and J. Collomosse, “Visual sentences for pose retrieval overlow-resolution cross-media dance collections,”

IEEE Transactions onMultimedia , vol. 14, no. 6, pp. 1652–1661, 2012. 1[15] H. Kadu and C.-C. J. Kuo, “Automatic human mocap data classiﬁcation,”

IEEE Transactions on Multimedia , vol. 16, no. 8, pp. 2191–2202, 2014.1[16] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deepneural networks,” in

CVPR , 2014. 1, 2, 3, 8[17] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint trainingof a convolutional network and a graphical model for human poseestimation,” in

NIPS , 2014. 1, 2, 8, 10, 11, 12[18] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in

ECCV , 2016. 1, 3, 4, 5, 7, 9, 10 [19] B. Sapp and B. Taskar, “Modec: Multimodal decomposable models forhuman pose estimation,” in CVPR , 2013. 1, 2, 8[20] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet condi-tioned pictorial structures,” in

CVPR , 2013. 1, 2[21] M. Sun and S. Savarese, “Articulated part-based model for joint objectdetection and pose estimation,” in

ICCV , 2011. 1, 2[22] Y. Tian, C. L. Zitnick, and S. G. Narasimhan, “Exploring the spatialhierarchy of mixture models for human pose estimation,” in

ECCV ,2012. 1, 2[23] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Human poseestimation using body parts dependent joint regressors,” in

CVPR , 2013.1, 2, 3[24] L. Karlinsky and S. Ullman, “Using linking features in learning non-parametric part models,” in

ECCV , 2012. 1, 2, 3[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015. 1[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016. 1, 2, 3, 4[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

CVPR , 2015. 1, 3[28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in

CVPR , 2016. 1, 3[29] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in

Advances in neural information processing systems , 2014, pp. 2654–2662. 1[30] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, andY. Bengio, “Fitnets: Hints for thin deep nets,”

ICLR , 2015. 1[31] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for objectrecognition,”

International journal of computer vision , vol. 61, no. 1, pp.55–79, 2005. 2[32] M. A. Fischler and R. A. Elschlager, “The representation and matchingof pictorial structures,”

IEEE Transactions on computers , vol. 100, no. 1,pp. 67–92, 1973. 2[33] Y. Yang and D. Ramanan, “Articulated human detection with ﬂexiblemixtures of parts,”

TPAMI , vol. 35, no. 12, pp. 2878–2890, 2013. 2, 3,8, 9[34] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Strong appear-ance and expressive spatial models for human pose estimation,” in

ICCV ,2013. 2, 10, 11, 12[35] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh,“Pose machines: Articulated pose estimation via inference machines,”in

ECCV , 2014. 2, 3[36] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in

Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on , vol. 1. IEEE, 2005, pp.886–893. 2, 6[37] X. Zhu, C. Vondrick, D. Ramanan, and C. C. Fowlkes, “Do we needmore training data or better models for object detection?.” in

BMVC ,vol. 3. Citeseer, 2012, p. 5. 2[38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

Advances in neuralinformation processing systems , 2015, pp. 91–99. 2[39] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Uniﬁed, real-time object detection,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2016, pp. 779–788. 2[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.2[41] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networksfor semantic segmentation,”

IEEE transactions on pattern analysis andmachine intelligence , vol. 39, no. 4, pp. 640–651, 2017. 2[42] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust opti-mization for deep regression,” in

Proceedings of the IEEE InternationalConference on Computer Vision , 2015, pp. 2830–2838. 2[43] X. Chen and A. L. Yuille, “Articulated pose estimation by a graphicalmodel with image dependent pairwise relations,” in

NIPS , 2014. 2, 11,12[44] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efﬁcientobject localization using convolutional networks,” in

CVPR , 2015. 2, 3,10[45] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V.Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling formulti person pose estimation,” in

CVPR , 2016. 2, 9, 10, 11, 12 [46] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele,“Deepercut: A deeper, stronger, and faster multi-person pose estimationmodel,” in

ECCV , 2016. 3, 9, 10, 11, 12[47] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

CVPR , 2015. 3[48] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutionalnetworks for mid and high level feature learning,” in

ICCV , 2011. 3[49] L. Wang, C.-Y. Lee, Z. Tu, and S. Lazebnik, “Training deeperconvolutional networks with deep supervision,” arXiv preprintarXiv:1505.02496 , 2015. 3[50] V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” arXiv preprint arXiv:1605.02914 , 2016. 3, 7, 9, 11, 12[51] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutionalpose machines,” in

CVPR , 2016. 3, 7, 9, 10, 11, 12[52] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human poseestimation with iterative error feedback,” in

CVPR , 2016. 3, 10[53] C.-Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets.” in

AISTATS , vol. 2, no. 3, 2015, p. 5. 3[54] A. Bulat and G. Tzimiropoulos, “Human pose estimation via convolu-tional part heatmap regression,” in

ECCV , 2016. 3, 7, 10, 11, 12[55] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “Thedifﬁculty of training deep architectures and the effect of unsupervisedpre-training.” in

AISTATS , vol. 5, 2009, pp. 153–160. 3[56] C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model compression,”in

Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining . ACM, 2006, pp. 535–541. 3[57] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015. 3[58] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps viaregressing local binary features,” in

CVPR , 2014. 3, 6[59] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261 , 2016. 4[60] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human poseestimation: New benchmark and state of the art analysis,” in

CVPR ,2014. 7, 8, 9[61] S. Johnson and M. Everingham, “Clustered pose and nonlinear appear-ance models for human pose estimation.” in

BMVC , 2010. 7, 8[62] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive searchspace reduction for human pose estimation,” in

Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on . IEEE,2008, pp. 1–8. 8[63] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari, “2d ar-ticulated human pose estimation and retrieval in (almost) unconstrainedstill images,”

International journal of computer vision , vol. 99, no. 2,pp. 190–214, 2012. 8[64] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in

Proceedings of the 22nd ACM internationalconference on Multimedia . ACM, 2014. 8, 10[65] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,”

COURSERA: Neuralnetworks for machine learning , vol. 4, no. 2, 2012. 8[66] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural net-works for scene labeling.” in

ICML , 2014. 8[67] U. Raﬁ, I. Kostrikov, J. Gall, and B. Leibe, “An efﬁcient convolutionalnetwork for human pose estimation,” in

BMVC , 2016. 10, 11, 12[68] G. Gkioxari, A. Toshev, and N. Jaitly, “Chained predictions usingconvolutional neural networks,” in

ECCV , 2016. 10[69] I. Lifshitz, E. Fetaya, and S. Ullman, “Human pose estimation usingdeep consensus voting,” in

ECCV , 2016. 10, 11, 12[70] P. Hu and D. Ramanan, “Bottom-up and top-down reasoning withhierarchical rectiﬁed gaussians,” in

CVPR , 2016. 10[71] X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network forobject landmark localization,” in

ECCV , 2016. 11, 12[72] W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning ofdeformable mixture of parts and deep convolutional neural networks forhuman pose estimation,” in

CVPR , 2016. 11, 12[73] X. Fan, K. Zheng, Y. Lin, and S. Wang, “Combining local appearanceand holistic view: Dual-source deep neural networks for human poseestimation,” in

CVPR , 2015. 11, 12[74] F. Wang and Y. Li, “Beyond physical connections: Tree models in humanpose estimation,” in

CVPR , 2013. 11, 12[75] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-likeenvironment for machine learning,” in