[PDF] Towards Interpretable and Robust Hand Detection via Pixel-wise Prediction

Abstract

The lack of interpretability of existing CNN-based hand detection methods makes it difficult to understand the rationale behind their predictions. In this paper, we propose a novel neural network model, which introduces interpretability into hand detection for the first time. The main improvements include: (1) Detect hands at pixel level to explain what pixels are the basis for its decision and improve transparency of the model. (2) The explainable Highlight Feature Fusion block highlights distinctive features among multiple layers and learns discriminative ones to gain robust performance. (3) We introduce a transparent representation, the rotation map, to learn rotation features instead of complex and non-transparent rotation and derotation layers. (4) Auxiliary supervision accelerates the training process, which saves more than 10 hours in our experiments. Experimental results on the VIVA and Oxford hand detection and tracking datasets show competitive accuracy of our method compared with state-of-the-art methods with higher speed.

Full PDF

TTowards Interpretable and Robust Hand Detection viaPixel-wise Prediction

Dan Liu a,1 , Libo Zhang b,2, ∗ , Tiejian Luo a,1 , Lili Tao c , Yanjun Wu b a University of Chinese Academy of Sciences, China 100049 b Institute of Software Chinese Academy of Sciences, China 100190 c University of the West of England, Bristol BS16 1QY, U.K.

Abstract

The lack of interpretability of existing CNN-based hand detection methodsmakes it diﬃcult to understand the rationale behind their predictions. In thispaper, we propose a novel neural network model, which introduces interpretabil-ity into hand detection for the ﬁrst time. The main improvements include: (1)Detect hands at pixel level to explain what pixels are the basis for its decisionand improve transparency of the model. (2) The explainable Highlight FeatureFusion block highlights distinctive features among multiple layers and learnsdiscriminative ones to gain robust performance. (3) We introduce a trans-parent representation, the rotation map, to learn rotation features instead ofcomplex and non-transparent rotation and derotation layers. (4) Auxiliary su-pervision accelerates the training process, which saves more than 10 hours inour experiments. Experimental results on the VIVA and Oxford hand detectionand tracking datasets show competitive accuracy of our method compared withstate-of-the-art methods with higher speed.

Keywords:

Interpretability, hand detection, pixel level, explainable ∗ Corresponding author

Email address: [email protected] (Libo Zhang) Dan Liu and Tiejian Luo were contributed equally and should be considered as co-ﬁrstauthors. Models and code are available at https://isrc.iscas.ac.cn/gitlab/research/pr2020-phdn. This work was supported by the National Natural Science Foundation of China, GrantNo. 61807033, the Key Research Program of Frontier Sciences, CAS, Grant No. ZDBS-LY-JSC038. Libo Zhang was supported by Youth Innovation Promotion Association, CAS, andOutstanding Youth Scientist Project of ISCAS.

Preprint submitted to Pattern Recognition January 14, 2020 a r X i v : . [ c s . C V ] J a n epresentation, rotation map

1. Introduction

Deep neural networks are widely adopted in many ﬁelds of study, e.g. , com-puter vision and natural language processing, and achieve state-of-the-art re-sults. However, as their inner workings are not transparent, the correctness andobjectivity of the predicting results cannot be guaranteed and thus limit theirdevelopment in industry. In recent years, some researchers have begun to exploreinterpretable deep leaning methods. [1] focuses on network interpretability inmedical image diagnosis. [2] decomposes output into contributions of its inputfeatures to interpret the image classiﬁcation network. There is also a clear needto develop an interpretable neural network in driving monitoring as the predict-ing results will directly aﬀect the safety of drivers, passengers, and pedestrians.In this paper, we present a highly interpretable neural network to detect handsin images, which is a basic task in driving monitoring.Hand detection in natural scenes plays an important role in virtual real-ity, human-computer interaction, driving monitoring [3, 4]. It is a critical andprimary task for higher-level tasks such as hand tracking, gesture recognition,human activity understanding. Particularly, accurately detecting hand is a vi-tal part in monitoring driving behavior [4, 5]. Detecting hands in images is achallenging task. The illumination conditions, occlusion, and color/shape simi-larity will bring great diﬃculties to hand detection. Moreover, hands are highlydeformable objects, which hard to detect due to their variability and ﬂexibility.Hands are not always shown in an upright position in images, so the rotationangle needs to be considered to locate the hand in images more accurately.The problem of hand detection has been studied for years. Traditional meth-ods extract features such as skin-related features [6], hand shape and back-ground, Histograms of Oriented Gradients (HOG) [7] to build feature vector foreach sample. Then these vectors are used to train classiﬁers such as SVM [8].Although the hand-crafted features have clear meanings and are easy to under-2 mage

Conv1Conv2_x

Conv3_x

Conv4_x

Conv5_x

Concatenate ...

Predict Image

Conv1Conv2_x

Conv3_x

Conv4_x

Conv5_x

Fusion_3 ...

Predict

Fusion_1Fusion_2 f f f f f f f f (a) (b) Figure 1: Diﬀerent connection modes of multi-scale features. (a) Serial mode. (b) Cascademode. stand, they are too limited to meet the requirements for the accuracy of handdetection in the real world. With the increasing inﬂuence of ConvolutionalNeural Networks (CNNs) in the ﬁeld of computer vision, many CNN-based ob-ject detection methods have emerged, Region-Based Convolutional Networks(R-CNNs) [9], Single Shot MultiBox Detector (SSD) [10], for example. Inspired bythese advances, many CNN-based methods have been proposed to deal withhand detection. Features are extracted automatically by designed CNNs fromthe original images [11, 12] or the region proposals [3] and then used to locatethe hands in original images. In order to extract as many eﬀective features aspossible to detect hand more accurately, the network structure is always verycomplicated and therefore has a heavy computational burden. This limits itsvalue in practical applications such as monitoring driving behavior and signlanguage recognition. The deep CNNs are used as black-boxes in the existingmethods. Diﬀerent from hand-crafted features, it is diﬃcult to know the mean-ing of features extracted by CNNs. As a result, the stability and robustness ofthese methods cannot be guaranteed.In view of the issues mentioned above, we propose an interpretable frame-work, Pixel-wise Hand Detection Network (PHDN), to detect hands more eﬃ-3iently. The proposed method achieves better performance with faster compu-tational speed. An explainable module named Highlight Feature Fusion (HFF)block is developed to get more discriminative features. With HFF block, PHDNperforms eﬀectively and stably in diﬀerent image contexts. To the best of ourknowledge, this is the ﬁrst time to give reasonable explanations of learned fea-tures in the hand detection procedure. Popular deep convolutional neural net-works VGG16 [13] or ResNet50 [14] is adopted as a backbone network in PHDN.The HFF block makes full use of multi-scale features by weighting the lower-level features with the higher-level features. In this way, the discriminativefeatures, namely the eﬀective ones for locating the hand, are highlighted in thedetection procedure. Each HFF block fuses features from two layers. It ﬁrstweights the lower-level features by the last higher-level feature maps and thenfuses the features by convolution operations. Several HFF blocks are connectedin cascade mode (see Fig. 1(b)) to iteratively fuse multi-scale features, whichgreatly reduces computational overhead and saves time compared to the serialconnection (see Fig. 1(a)). As PHDN makes hand region predictions with multi-scale features, it is more robust to hands of diﬀerent sizes. In other words, ourmodel is scale-invariance.As for the rotated hand detection, adding additional rotation and derota-tion layers [15] makes the network more complicated and thus increases thecomputational burden and time overhead. We propose the rotation map andthe distance map to store the rotation angle and the geometry information ofthe hand region respectively, which handles the rotation hands without increas-ing complexity of the network and learns more interpretable representations ofangles by recording angles of pixels directly.In the training process, we add supervision to each HFF block. Deep super-vision to the hidden layers makes the learned features more discriminative androbust, and thus the performance of the detector is better. The auxiliary lossesaccelerate the convergence of training in a simple and direct way compared with[16], which accelerates training by constraining the input weight of each neuronwith zero mean and unit norm. 4xisting detection methods make predictions for grid cells [17] or defaultboxes [10], which need to seek appropriate anchor scales. Alternatively, we pre-dict hand regions at pixel resolution to avoid the adverse eﬀects of improperanchor scales settings, for which we name our model as Pixel-wise Hand Detec-tion Network. Detecting hands at pixel level also explains what pixels are thebasis for its decision, which improves transparency of the model. The hand re-gions predicted by PHDN are ﬁltered by the Non-Maximum Suppression (NMS)to yield the ﬁnal detection results.To evaluate our model, experiments are conducted on two authentic and pub-licly accessible hand detection datasets, the VIVA hand detection dataset [18]and the Oxford hand detection dataset [8]. Compared with the state-of-the-artmethods, our model achieves competitive Average Precision (AP) and AverageRecall (AR) on VIVA dataset with 4.23 times faster detecting speed, and obtains5.5% AP improvement on Oxford dataset. Furthermore, we test the PHDN withthe hand tracking task on VIVA hand tracking dataset [19], which is a higher ap-plication scenario of hand detection. We try three tracking-by-detection meth-ods: SORT tracker [20], deep SORT tracker [21] and IOU tracker [22], wherethe PHDN acts as a detector. Experimental results show that using any of theaforementioned tracking algorithms based on our detector can achieve betterresults than existing methods. It indicates that PHDN is robust and practi-cable as the detector performance plays a crucial role in tracking-by-detectionmultiple object tracking methods.Part of the work has been introduced in [23]. The extensions made in thisarticle compared to [23] are as follows: (1) We analyze the interpretabilityof our model by visualizing the features extracted by HFF block to interpretour model. It shows the mechanism of internal layers and demonstrates howour method outperforms the others. (2) We integrate our detector with thepopular trackers to track hands in videos and achieve state-of-the-art resultson the authoritative VIVA hand tracking challenge dataset [19]. (3) We give amore detailed description of our model including related work in hand detectionand multiple hand tracking in vehicles, network architecture, feature fusion5rocessing, loss functions and the settings and results of conducted experiments.The main contributions of this paper are in four folds: • We give insight to the interpretability of the hand detection network forthe ﬁrst time. Reasonable explanations for the feature activated in handdetection procedure and the discriminative features learned by HFF blockare ﬁrst given. The proposed Pixel-wise Hand Detection Network predictshand regions at pixel resolution rather than grid cells or default boxes. Itgets rid of the adverse eﬀects of inappropriate anchor scales and can detectdiﬀerent sizes of hands by fusing multi-scale features with the cascadedHFF blocks. • The rotation map is designed to predict hand rotation angles precisely. Itlearns and represents the angles in an interpretable way with less compu-tational cost. • Auxiliary losses are added to provide supervision to hidden layers of thenetwork, leading to faster convergence of the training and higher precision. • Experiments on VIVA and Oxford hand detection datasets show thatPHDN achieves competitive performance compared with the state-of-the-art methods. Evaluated on the VIVA hand tracking dataset, tracking-by-detection trackers such as SORT tracker, deep SORT tracker and IOUtracker with the PHDN detector outperform the existing hand trackingmethods.The remainder of this paper is organized as follows. In Section 2, we reviewthe related work in the ﬁeld. Section 3 gives a detailed description of theproposed method. Section 4 introduces the datasets and experimental setup,reports and analyzes the results. Finally, concluding remarks are presented inSection 5. 6 otation Map (Ours) θ F C R EL U R o t E s t . D e r o t a ti on F C × Rotation Network Derotation

Layer

Derotation and Rotation Layers (Deng et al. 2018)

Figure 2: Novel and transparent representation of the rotation angle. We use the rotation mapto store the rotation angle instead of adding rotation and derotation layers [15] to networks.

2. Related Work

Current hand detection methods can be divided into two categories. One isbased on the hand-crafted structured features, such as color, shape and so on.The other is based on features extracted by CNNs. The methods based on hand-crafted features have strong interpretability, but the detection performance ispoor due to the limitations of features. On the contrary, CNNs-based methodstend to have good performance but poor interpretability.

Hand detection methods that use human crafted features usually proposehand regions using features like skin color, hand shape, Histograms of OrientedGradients (HOG) [24]. These features have speciﬁc meanings and are easy tounderstand. Then the features are used to train a classiﬁer, such as SupportVector Machine (SVM) [8], to generate the ﬁnal detection results. [25] usesthe skin and hand shape features to detect hands from images. Skin areasare extracted ﬁrst using a skin detector and the hands are separated out usinghand contour comparison. However, it may be confusing when distinguishingbetween face and ﬁst since their contours are similar. [8] generates hand regionproposals using a hand shape detector, a context-based detector and a skin-based detector. Then a SVM classiﬁer, with the score vectors built by the three7etectors as input, is trained to classify the hand and non-hand regions. Toenhance the robustness of hand detection in cluttered background, [26] proposesthree new features based on HOG, Local Binary Patterns (LBP) and LocalTrinary Patterns (LTP) descriptors to train classiﬁers, but it does not performwell if the image is low resolution and it cannot handle well with occlusion. [7]trains a SVM classiﬁer with the HOG features, and extends it with a DynamicBayesian Network for better performance. Due to the limitation of hand-craftedfeatures, these methods are not robust to the change of illumination, backgroundand hand shape. Moreover, the non-end-to-end optimization process is time-consuming and the performance is often suboptimal.

Inspired by the progress of Convolutional Neural Networks (CNNs), manyhand detection methods proposed recently are based on CNNs. [3] presents alightweight hand proposal generation approach, of which a CNN-based methodis used to disambiguate hands in complex egocentric interactions. Context in-formation, such as hand shapes and locations, can be seen as prior knowledge,and they can be used to train a hand detector [27]. However, it is no doubt thatadditional context cues over-complicates the image preprocessing step. Inspiredby these, [11] ﬁrst generates hand region proposals with the Fully Convolu-tional Network (FCN) [28] and then fuses multi-scale features extracted fromFCN into a large feature map to make ﬁnal predictions, as a result of whichthe convolution operations are time-consuming in the later steps. Similarly,[12] concatenates the multi-scale feature maps from the last three pooling layersinto a large feature map. Although diﬀerent receptive ﬁelds are taken into con-sideration, simple concatenation of feature maps results in high computationalcost.In contrast to human-crafted features, the features extracted by CNNs arenot interpretable and thus the rationality and validity of the model are diﬃcultto verify. In order to provide interpretability to CNN-based hand detectionmodels, we detect hands at pixel level. For any pixel in the image, we predict8hether it belongs to a hand and the bounding box of the hand. In this way,we can know the basis for the model to make predictions. Under the fact thatthe high-level feature maps reﬂect the global features while the low-level featuremaps contain more local information, the feature maps from diﬀerent scales areweighted before merged so that the features from multiple scales can complementeach other in the subsequent process. In view of the heavy computational burdencaused by the fusion of multi-scale information, our model fuses multi-scalefeatures iteratively rather than simultaneously.Another issue of hand detection is to handle the rotation. Hands are rarelyshown in upright positions in images. To accurately detect hands and estimatetheir poses, [15] designs a rotation network to predict the rotation angle of re-gion proposals and a derotation layer to obtain axis-aligned rotating featuremaps (see Fig. 2). However, the method is of great complexity as it includestwo components for rotation, a shared network for learning features and a de-tection network for the classiﬁcation task. It is also hard to ﬁnd out what therotation and derotation layers really learn. To handle rotated hand samplesmore eﬀectively, we develop the rotation map to replace the complex rotationand derotation layers, as shown in Fig. 2. It is also more interpretable as eachpixel value represents the rotation angle directly. The results on the Oxfordhand detection dataset show that the rotation map brings a signiﬁcant increase(about 0.30) in AP compared to using only the distance maps.

Tracking hands in the vehicle cabin is important for monitoring drivingbehavior and research in intelligent vehicles. Although hand tracking has beenstudied since the last century, there are few studies on tracking multiple handssimultaneously in naturalistic driving conditions. To the best of our knowledge,only [5] has given the research results on multiple hand tracking so far. [5]proposes a tracking-by-detection method, where each video frame is processedby the detector ﬁrst and then integrates with a tracker to provide individualtracks online. The ACF detector [29] is used to generate hand detection results9nd the data association is performed using a bipartite matching algorithm. Itreports the tracking results on the VIVA hand tracking dataset. To investigatethe performance of our model in hand tracking, we apply PHDN to SORTtracker [20], deep SORT tracker [21], IOU tracker [22]. SORT tracker anddeep SORT tracker are online tracking methods, where only the current andprevious frames are visible to the tracker. SORT tracker performs Kalmanﬁltering in image space and uses the Hungarian method to associate detectionsacross frames in a video sequence. Deep SORT tracker is developed for the manyidentity switches in SORT tracker. It adopts a novel association metric withmore motion and appearance information compared to the IOU distance usedin SORT tracker. The reported results show the deep SORT tracker has feweridentity switches than the SORT tracker. IOU tracker is an oﬄine trackingmethod that can generate trajectories with all observations in the video. Itassociates the detection with the highest IOU to the last detection in previousframes to extend a trajectory. It can run at 100K fps as its complexity is verylow. The tracking performance depends largely on the detector. Therefore, weconduct experiments on the VIVA hand tracking dataset with our detector andwe use three trackers to evaluate our model in the practical tracking task.

3. Interpretable Pixel-wise Hand Detection Network

The PHDN architecture is illustrated in Fig. 3. To show our model moreclearly, only the VGG16 backbone is presented in the ﬁgure for its simplerstructure compared with ResNet50. The feature maps from four diﬀerent scalesextracted by the VGG16 extractor or ResNet extractor are fused iterativelyin the cascaded HFF blocks. The ﬁnal feature maps, containing multi-scaleinformation, are upsampled and convoluted to get the score map, the rotationmap and the distance map. With the three kinds of maps, we can restore thehand bounding boxes and ﬁlter them by the NMS to generate the ﬁnal handregions. In the following, we describe the pipeline in detail and construct theloss function for the training. 10 ax Pooling, /23×3, 64

Max Pooling, /2

Max Pooling, /23×3, 5123×3, 512

Max Pooling, /2Max Pooling, /2512×512×3

Max Pooling, /2

Upsampling, ×2 ×2×2×3×3 Concatenate Upsampling, ×2Mask Upsampling, ×2Concatenate1024

Mask

Concatenate, 2561×1, 64

Score Map Rotation Map Distance Map

Restore & NMSInput ImageMax Pooling, /23×3, 643×3, 64

Upsampling, ×2

Mask Upsampling, ×2

Concatenate

Concatenate, 5121×1, 1283×3, 128Mask

Mask

Concatenate, 2561×1, 643×3, 643×3, 32Upsampling, ×4Score Map Rotation Map Distance Map

Input Image

Restore

NMS f f f f M( f , f ׳ ) f ׳ M( f , f ׳ )M( f , f ׳ ) f ׳ f ׳ f ׳ L L L L Figure 3: PHDN architecture with VGG16 as the backbone. The left is feature extractingstem, and the right is feature fusion branch and the output layers. Highlight Feature Fusion(HFF) block is marked with red dotted rectangle.

We try two popular deep convolutional networks, i.e. , VGG16 and ResNet50,to extract features from the images. The pre-trained model on the ImageNetdataset [30] is used in our study. Feature maps from four layers are selectedfor the feature fusion module. For VGG16, we adopt the feature maps from pooling-2 to pooling-5 . Similarly, the outputs of conv2 1 , conv3 1 , conv4 1 and conv5 1 are extracted in ResNet50. The feature maps extracted from VGG16or ResNet50 are ( ) , ( ) , ( ) , ( ) the size of input images, and representinformation of diﬀerent sizes of receptive ﬁelds.11 lgorithm 1 Feature Fusion Procedure

Input:

Feature maps extracted by VGG16 or Resnet50, f s , s ∈ { , , , } ;Channels of fused feature maps, c s , s ∈ { , , , } ; Output:

Fused feature maps, f (cid:48) s , s ∈ { , , , } ; f (cid:48) = f ; for s from 2 to 0 do u s +1 = U psampling ( f (cid:48) s +1 ); masked = f s ∗ (1 − Convolution ( u s +1 , × Concate = Concatenate ( masked, u s +1 ); Conv

Convolution ( Concate, × , c s ); Conv

Convolution ( Concate, × , c s ); f (cid:48) s = Conv end for return f (cid:48) s , s ∈ { , , , } ; The size of hands varies greatly in diﬀerent images or even the same image.The larger hand detection needs more global information. It is known that thehigher the level of feature maps, the more global the information is presented.Hence multi-scale feature maps should be merged to detect diﬀerent sizes ofhands. We propose to fuse the feature maps from multiple layers in an itera-tive way to reduce the computational cost, which can be achieved by cascadedfeature fusion blocks as shown in Fig. 1(b) To reduce the interference of use-less features and learn more discriminative features, we develop the HighlightFeature Fusion (HFF) block to fuse the features from diﬀerent scales. Fig. 3 dis-plays three cascaded HFF blocks, which are marked with red dotted rectangles.The cascaded HFF blocks operate the fusion as Algorithm 1.We generate a mask with the higher-level feature maps to ﬁlter the commonfeatures in the current level feature maps, which formulated as Line 4 above12 X Y X' Y' θ p p' d l -d b y' x' xy p p p (p ')p x - x' y - y' p ' p ' p ' -(d t + d b ) (d l + d r ) XY X' Y' ld td bd rdd l d t d b d r Figure 4: Restore hand bounding boxes from the rotation map and distance map. and ∗ denotes element-wise multiplication. Masking f s with the complemen-tary feature maps of u s +1 can highlight the ﬁne-grained distinctive informationcontained in f s that u s +1 may not have. Conv × × f s and u s +1 directly as a Base Feature Fusion (BFF) block in our experiments.We visualize features extracted by HFF block and BFF block to interpretthe robustness and eﬀectiveness of HFF block in Section 4.5.1. For each pixel in the image, we generate the conﬁdence that it belongs to ahand region and the corresponding hand bounding box. In this way, the modelcan interpret what features the prediction is based on. The following paragraphselaborate on this process.After the last HFF block, the feature maps go through a 3 × ×

1, 1 × × − π/ , π/ d t , d r , d b , d l in Fig. 4.Hand boxes are generated with the rotation map and distance map for pixelswhose scores are higher than a given threshold in the score map. An exampleis given in Fig. 4 to illustrate the restoring process for pixel p . Based on thedistance map we can obtain the distances d t , d r , d b , d l from p to the fourboundaries (top, right, bottom, left) of the rectangle R p . In order to calculatethe coordinates of p , p , p , p in image coordinate system (drawn in blackin Fig. 4), an auxiliary coordinate system (drawn in red in Fig. 4) is introducedwith p as the origin. The directions of X-axis and Y-axis are the same asthe image coordinate system. We rotate R p to the horizontal around p . Thecorresponding position of p in the rotated rectangle R (cid:48) p is denoted as p (cid:48) . Let( x (cid:48) , y (cid:48) ) , ( x (cid:48) i , y (cid:48) i ) , i ∈ { , , } be the coordinates of p, p i , i ∈ { , , } in theauxiliary coordinate system. For the clockwise rotation of rectangle R p , wehave M ( θ )  x (cid:48) y (cid:48)  =  d l − d b  ,M ( θ )  x (cid:48) y (cid:48)  =  − ( d t + d b )  ,M ( θ )  x (cid:48) y (cid:48)  =  d l + d r − ( d t + d b )  ,M ( θ )  x (cid:48) y (cid:48)  =  d l + d r  , (1)14here M ( θ ) is the rotation matrix in two-dimensional space, which can beformulated as M ( θ ) =  cos θ − sin θ sin θ cos θ  . (2) θ is the rotation angle with counter-clockwise as the positive direction, and itcan be restored from the rotation map in our experiments.Finally, the coordinates ( x i , y i ) , i ∈ { , , , } of p i in the image coordinatesystem are calculated by  x y  =  xy  −  x (cid:48) y (cid:48)  ,  x i y i  =  x (cid:48) i y (cid:48) i  +  x y  , i ∈ { , , } . (3)( x, y ) are the coordinates of p in the image coordinate system. According toEq. (1) ∼ (3), the hand bounding box R p = { ( x i , y i ) | i ∈ { , , , }} correspond-ing pixel p can be restored with the rotation map and distance map.Many redundant detection bounding boxes are produced by the network. Togenerate pure detection results, we use the NMS to ﬁlter the boxes with lowscores and high overlapping rates. The detection loss function usually includes the conﬁdence loss and the lo-cation loss. Speciﬁc to our method, the conﬁdence loss is calculated with thescore map, and the location loss consists the rotation loss and the geometry loss,related to the rotation map and distance map respectively. To learn a more dis-criminative mask in the HFF, deep supervision is added to the intermediateHFF blocks with auxiliary losses ( L s , s = 1 , , L forthe output. The overall objective loss function is formulated as L = (cid:88) s ∈ S w s L s , (4)where S = { , , , } represents the scale index of the HFF blocks as shownin Fig. 3 and the parameter w s adjusts the weight of the corresponding scale.15or scale s , the loss L s is a weighted sum of the losses for the score map L [ s ] sco ,rotation map L [ s ] rot and distance map L [ s ] dis : L s = αL [ s ] sco + βL [ s ] rot + L [ s ] dis . (5)The factors α and β control the weights of the three loss terms. We describethese three parts of the loss in detail below. Regarding the score map as a segmentation of the input image, we use theDice Similarity Coeﬃcient [32] (DSC) to construct the loss for score map. DSCmeasures the similarity between two contour regions. Let

P, G be the pointsets of two contour regions respectively, then the DSC is deﬁned as

DSC ( P, G ) = 2 | P (cid:84) G || P | + | G | . (6) | P | (. | G | ) represents the number of elements in set P ( G ). As the ground truthof the score map is a binary mask, the dice coeﬃcient can be written as DSC ( P, G ) = 2 (cid:80) Ni =1 p i g i (cid:80) Ni =1 p i + (cid:80) Ni g i , (7)where the sums run over all N pixels of the score map. p i is the the pixelin the score map P generated by the detection network, and g i is the pixelin the ground truth map G . Based on the dice similarity coeﬃcient, the diceloss is proposed and proved to perform well in segmentation tasks [33, 32, 34].Motivated by this strategy, the loss for the score map is formulated as L sco = 1 − (cid:80) Ni =1 p i g i + ε (cid:80) Ni =1 p i + (cid:80) Ni =1 g i + ε , (8)where ε is the smooth. The rotation map stores the predicted rotation angles for correspondingpixels in the input image. The cosine function is adopted to evaluate the distance16etween the predicted angle ˜ θ i and the ground truth θ i . Consequently, we cancalculate the loss of rotation map by L rot = 1 − N N (cid:88) i =1 cos (cid:16) ˜ θ i − θ i (cid:17) . (9) As for the regression of the object bounding box, the l loss [35] performsthe four distances d t , d r , d b , d l as independent variables, which may misleadthe training when only one or two bounds of the predicted box are close to theground truth. To avoid this, [36] proposes the IoU loss which treats the fourdistances as a whole. Besides, the IoU loss can handle bounding boxes withvarious scales as it uses the IoU to norm the four distances to [0 , L dis = − N N (cid:88) i =1 ln I [ i ] + ε U [ i ] + ε ,I [ i ] = I [ i ] h ∗ I [ i ] w ,I [ i ] h = min ( d t , ˜ d t ) + min ( d b , ˜ d b ) ,I [ i ] w = min ( d l , ˜ d l ) + min ( d r , ˜ d r ) ,U [ i ] = X [ i ] + ˜ X [ i ] − I [ i ] ,X [ i ] = ( d t + d b ) ∗ ( d l + d r ) , ˜ X [ i ] = ( ˜ d t + ˜ d b ) ∗ ( ˜ d l + ˜ d r ) , (10)where N is the number of pixels in the distance map and ε is the smooth term. I [ i ] and U [ i ] denote the intersection and union of the predicted box { ˜ d t , ˜ d r , ˜ d b , ˜ d l } and the ground truth { d t , d r , d b , d l } respectively.

4. Experiments

We evaluate our detector on three benchmark datasets: the VIVA handdetection dataset [18], the Oxford hand detection dataset [8] and the VIVAhand tracking dataset [19]. 17 .1. Experimental Settings

All experiments are conducted on an Intel(R) Core(TM) i7-6700K @ 4.00GHzCPU with a single GeForce GTX 1080 GPU. We try two backbone networks:VGG16 [13] and ResNet50 [14] for feature extraction and use the pre-trainedmodels on ImageNet [30]. We employ the network with the Base Feature Fusion(BFF) block as our base model and conduct ablation experiments to evaluatethe performance of the Highlight Feature Fusion (HFF) block and the auxiliarylosses.Training is implemented with a stochastic gradient algorithm using theADAM scheme. We take the exponential decay learning rate, the initial value ofwhich is 0 . ,

000 iterations with rate 0 .

94. The weightparameters w s , s ∈ { , , , } are all set to 1 for default. The hyper-parameters α , β are set to 0 .

01 and 20, respectively. Besides, the score map threshold isset to 0 .

8. In other words, all the pixels that obtain scores higher than 0 . . ×

512 before fed intothe network in training. When predicting on the test dataset, the original sizeof the input image is preserved as the network is a fully convolutional networkthat allows arbitrary sizes of input images.

VIVA Hand Detection Dataset is published by the Vision for Intelligent Ve-hicles and Applications Challenge [18] for hand detection subtask. The datasetincludes 5 ,

500 training and 5 ,

500 testing images. The images are collected from54 videos captured in naturalistic driving scenarios. There are 7 possible view-points in the videos. Annotations for the images are publicly accessible. The18 able 1: Results on VIVA Hand Detection Dataset

Methods Level-1(AP/AR)/% Level-2(AP/AR)/% Speed/fps EnvironmentMS-RFCN [11] /77.3 4.65Multi-scale fast RCNN [12] 92.8/82.8 84.7/66.5 3.33 6 [email protected], 64GB RAM, Titan X GPUFRCNN [27] 90.7/55.9 86.5/53.3 - -YOLO [17] 76.4/46.0 69.5/39.1 35.00 6 [email protected], 16GB RAM, Titan X GPUACF Depth4 [18] 70.1/53.8 60.1/40.4 - -Ours (VGG16+BFF) 88.9/82.8 72.6/56.7 13.88 4 [email protected], 32GB RAM, GeForce GTX 1080Ours (VGG16+BFF+Auxiliary Losses) 92.9/88.3 80.9/62.7 13.16Ours (VGG16+HFF+Auxiliary Losses) 92.3/89.1 83.6/68.8 13.10Ours (ResNet50+BFF) 93.7/89.9 83.6/73.6 20.40Ours (ResNet50+BFF+Auxiliary Losses) 94.0/90.1 85.7/74.0 20.00Ours (ResNet50+HFF+Auxiliary Losses) / / bounding boxes of hand regions in an image are given by ( x, y, w, h ) in the .txt format annotation ﬁle. x, y are the upper-left coordinates of the box and w , h are the width and height of the box, respectively. As the given annotations areaxis-aligned, the rotation angles are set to 0 in training and the predictions areaxis-aligned bounding boxes in our experiments on this dataset.We evaluate the algorithms on two levels according to the size of the handinstances using the evaluate kit provided by the Vision for Intelligent Vehi-cles and Applications Challenge. Level-1 focuses on the hand instances with aminimum height of 70 pixels, only over the shoulder (back) camera view, while

Level-2 evaluates hand samples with a minimum height of 25 pixels in all cameraviews. Evaluation metrics include the Average Precision (AP) and Average Re-call (AR). AP is the area under the Precision-Recall curve and AR is calculatedover 9 evenly sampled points in log space between 10 − and 10 false positivesper image. As performed in PASCAL VOC [38], the hit/miss threshold of theoverlap between a pair of predicted and ground truth bounding boxes is set to0 . -3 -2 -1 false positives per image t r ue po s i t i v e s r a t e OursMS-FRCNNFRCNNACFDepth4YOLO -3 -2 -1 false positives per image t r ue po s i t i v e s r a t e OursMS-FRCNNFRCNNACFDepth4YOLO

Level-1

ROC Curve(c) -3 -2 -1 false positives per image t r ue po s i t i v e s r a t e OursMS-FRCNNFRCNNACFDepth4YOLO

Level-1

ROC Curve(c)

Level-1

PR Curve(a)

Level-2

PR Curve(b)

Level-1

ROC Curve(c)

Level-2

ROC Curve(d) recall p r e c i s i on Level-1

PR Curve(a) -3 -2 -1 false positives per image t r ue po s i t i v e s r a t e -3 -2 -1 false positives per image t r ue po s i t i v e s r a t e Level-2

ROC Curve (d)

Level-2

PR Curve(b)

OursMS-FRCNNFRCNNACFDepth4YOLOOursMS-FRCNNFRCNNACFDepth4YOLO recall p r e c i s i on Ours

MS-FRCNN

FRCNNACFDepth4YOLO

Figure 5: Precision-Recall curves and ROC curves (logarithmic scale for x-axis) on VIVAdataset.

The Precision-Recall curves and ROC curves of these methods and our model(ResNet50+HFF+Auxiliary Losses) are shown in Fig. 5. Our model achieves92 . / .

1% (AP/AR) at

Level-1 while 83 . / .

8% (AP/AR) at

Level-2 us-ing VGG16 as the backbone network. The ResNet50 based PHDN networkobtains more accurate performance, i.e. , 94 . / .

1% (AP/AR) at

Level-1 and86 . / .

8% (AP/AR) at

Level-2 .Apart from the accuracy, the detection speed is also an important metric.As we can see in Table. 1, YOLO [17] performs hand detection in real-time, butits accuracy is unsatisfactory. On the contrary, MS-RFCN [11] performs againstother detectors in accuracy but the detecting speed is very slow, i.e. , 4 .

65 fps.With our PHDN based on VGG16 and ResNet50, the detection speeds are up20 able 2: Results on Oxford Hand Detection Dataset

Methods AP/%MS-RFCN [11]

Multiple proposals [8] 48.2Multi-scale fast CNN [12] 58.4Ours (VGG16+BFF) 68.7Ours (VGG16+BFF+Auxiliary Losses) 77.8Ours (VGG16+HFF+Auxiliary Losses) 78.0Ours (ResNet50+BFF) 78.2Ours (ResNet50+BFF+Auxiliary Losses) 78.6Ours (ResNet50+HFF+Auxiliary Losses) recall p r e c i s i on recall p r e c i s i on recall p r e c i s i on false positives t r ue po s i t i v e s r a t e false positives t r ue po s i t i v e s r a t e false positives t r ue po s i t i v e s r a t e PR Curve(a) ROC Curve(b)

Figure 6: Precision-Recall curve and ROC curve on oxford dataset. to 13 .

10 and 19 .

68 fps, respectively. The model (ResNet50+HFF+AuxiliaryLosses) obtains competitive accuracy while a 4 .

23 times faster running speedcompared to [11]. Therefore, it is of great signiﬁcance that our model achievesa good trade-oﬀ between accuracy and speed.

Oxford Hand Detection Dataset consists of three parts: the training set, thevalidation set and the testing set, with 1 , able 3: Results on VIVA Hand Tracking Dataset Methods MOTA/% MOTP/% Recall/% Precision/% MT/% ML/% IDS FRAGOnline TDC(CNN) [5] 25.1 64.6 - - 39.1 18.8 34 415TDC(HOG) [5] 24.6 64.5 - - 35.9 17.2 39 426Ours+SORT 83.4 Ours+Deep SORT ( x i , y i ) , i ∈ { , , , } of the box in the format of .mat and not necessarily tobe axis-aligned but oriented with respect to the wrist. The rotation angle willbe calculated furthermore in our experiments.According to the oﬃcial evaluation tool on the Oxford dataset, we reportthe performance on all the “bigger” hand instances, those with more than 1 , .

5% in AP score compared with thestate-of-the-art MS-RFCN [11]. VGG16 based PHDN still outperforms MS-RFCN [11] by 2 .

9% in AP score. The Precision-Recall curve and ROC curveare presented in Fig. 6. In addition, it is worth mentioning that the detectingspeed on the Oxford dataset is up to 62 . . VIVA hand tracking dataset is built by the Vision for Intelligent Vehiclesand Applications Challenge for hand tracking sub contest. There are 27 trainingand 29 test sequences captured under naturalistic driving conditions in thisdataset and 2D bounding box annotations of hands are provided with { frame,id, bb left, bb top, bb width, bb height } . Evaluation metrics [5] follow standardmultiple object tracking and are listed as follows. MOTA (The Multiple Object Tracking Accuracy):

A comprehen-sive metric combining the false negatives, false positives and mismatchrate. • MOTP (The Multiple Object Tracking Precision):

Overlap be-tween the estimated positions and the ground truth averaged by all thematches. • Recall:

Ratio of correctly matched detections to ground truth detections. • Precision:

Ratio of correctly matched detections to total result detec-tions. • MT (Most Tracking):

Percentage of ground truth trajectories whichare covered by the tracker output for more than 80% of their length. • ML (Most Lost):

Percentage of ground truth trajectories which arecovered by the tracker output for less than 20% of their length. • IDS (ID Switches):

Number of times that a tracked trajectory changesits matched ground truth identity. • FRAG (Fragments):

Number of times that a ground truth trajectoryis interrupted in the tracking result.For MOTA, MOTP, Recall, Precision and MT, greater values mean better per-formance, whereas the ML, IDS and FRAG are the smaller the better.To evaluate our detector, we employ the SORT tracker [20], deep SORTtracker [21] and IOU tracker [22] to associate our detection results to extenda trajectory on the VIVA hand tracking dataset. The results are reported inTable. 3. The model (ResNet50+HFF+Auxiliary Losses) is used to generatedetection results. Note that, we present the Recall and Precision of our methodas they are metrics concerned with the detection performance in multiple objecttracking. Our model (ResNet50+HFF+Auxiliary Losses) performs much betterthan the existing methods on this dataset. It indicates that our detector ispracticable and well-performed in hand tracking task.23 a) (b) (a) (b)

Figure 7: The change of AP with α and β on the Oxford dataset. (a) PHDN with ResNet50 and Base Feature Fusion (BFF) block(b) PHDN with ResNet50 and Highlight Feature Fusion (HFF) block(a) PHDN with ResNet50 and Base Feature Fusion (BFF) block(b) PHDN with ResNet50 and Highlight Feature Fusion (HFF) block Figure 8: Visual explanations for predictions. The heatmap in the blue-yellow-red color scaleis added to the original image to show the activated regions.

Ablation experiments are conducted to study the eﬀect of diﬀerent aspects ofour model on the detection performance. We choose the ResNet50 as a defaultbackbone network and Oxford hand detection dataset to do further analysis ofour model. 24 .5.1. Interpretable and Robust HFF Block

Some visual explanations for the eﬀectiveness and robustness of HFF blockare given in Fig. 8. The activation feature map is converted into a blue-yellow-red color scale and then added to the original input image to see which pixelsare activated in the detection procedure. We can see that the HFF block isgood at locating discriminative pixels comparing with the BFF block. TheHFF block keeps oﬀ confusing parts like faces and feet. It can also activatethe hand pixels accurately even in clutter background as shown in the secondexample in Fig. 8(b). HFF block uses the mask to ﬁlter the redundant featuresof the corresponding layer while the BFF does not.From Table. 1 and 2, we can see that the HFF block outperforms the BFFblock whether using the VGG16 or ResNet50 as the backbone. Speciﬁcally,with VGG16 as the backbone and evaluated at

Level-2 , HFF block achieves animprovement of 2 .

7% in AP and 6 .

1% in AR on VIVA hand detection dataset.With ResNet50, there are 0 .

6% in AP and 1 .

8% in AR respectively. The ARscore is improved greatly, which indicates that the model with the HFF blockproduces less false negatives than the BFF block and makes better use of thedistinctive features of diﬀerent scales. The HFF block also show better perfor-mance on the Oxford dataset: It gains an improvement of 0 .

2% in AP scorewith VGG16 and 2 .

0% with ResNet50 comparing to the BFF block.

We adjust the value of α in Eq. (4) to ﬁnd appropriate weights of score mapin training. The results are reported in Fig. 7(a). As α increases from 0 .

01 to1, the AP increases ﬁrst and then decreases. It reaches the maximum 0 . α takes 0 .

10 in our experiments. As we can see, if weight the classiﬁcationloss highly, the AP score will decline (0 . . Number of scales T r a i n i ng T i m e / h Training timeAP AP Figure 9: Training time and AP score vs. diﬀerent numbers of scales on the Oxford dataset. further locate the hand more accurately. To investigate the role it plays in thedetection, we control the weights of rotation map in the training process bychanging β in Eq. (4). We ﬁrst set β to 0, i.e. , ignore the rotation map intraining, to obtain detection results. Then we try four diﬀerent values (1, 5,10 and 20) for β to train models and evaluate all the detection results on theOxford test set. The AP score and corresponding β are plotted in Fig. 7(b)When considering the rotation angle in the optimization procedure, i.e. , β > .

78 for all the values of β tried in ourexperiments. Otherwise, there is a signiﬁcant drop in the AP score (0 . . β is set as 0. Therefore, the rotation map playsa very important role in optimizing the ﬁnal model and can improve the locatingaccuracy greatly. In order to investigate the eﬀectiveness of the auxiliary losses, we train mod-els considering diﬀerent numbers of scales. The variation of training time andAP score with the number of supervision scales is shown in Fig. 9. The numberof scales 1 , , , S = { } , S = { , } , S = { , , } , S = { , , , } in Eq. (4) respectively. From Fig. 9, we can see that the time it takes for themodel to convergence decreases as the number of scales used in loss function in-26 a) Examples from VIVA hand detection dataset(a) Examples from VIVA hand detection dataset (b) Examples from Oxford hand detection dataset (c) Human annotations for VIVA hand tracking dataset (d) Tracking results by our detector with SORT tracker on VIVA hand tracking dataset (b) Examples from Oxford hand detection dataset (c) Examples from VIVA hand tracking dataset

Figure 10: Detection results visualization. Annotations of VIVA hand detection dataset andVIVA hand tracking dataset are horizontal bounding boxes. Images in Oxford hand detectiondataset are labeled with wrist-oriented boxes. (a) (b)(a) Examples from VIVA hand detection dataset (b) Examples from Oxford hand detection dataset (c) Manual annotations for VIVA hand tracking dataset (d) Tracking results by our detector with SORT tracker on

VIVA hand tracking dataset

Figure 11: Detection results comparisons. (a) and (b) compare the performance between ourPHDN based on ResNet50 model (cyan bounding boxes) and Multi-scale fast RCNN [12] (redbounding boxes). (c) and (d) show the ground truth and our tracking results on the VIVAhand tracking dataset. creases. The convergence of the network is accelerated signiﬁcantly (more than10 hours) by adding auxiliary losses into the total loss. At the same time, theAP score is stable regardless of the number of scales. It can be concluded thatthe auxiliary losses accelerate the training process without sacriﬁcing the APscore. This is attributed to the multiple supervision to the intermediate layersof the network. 27 a) Examples from VIVA hand detection dataset(a) Examples from VIVA hand detection dataset (b) Examples from Oxford hand detection dataset (c) Human annotations for VIVA hand tracking dataset (d) Tracking results by our detector with SORT tracker on

VIVA hand tracking dataset (b) Examples from Oxford hand detection dataset (c) Examples from VIVA hand tracking dataset(a) Examples from VIVA hand detection dataset (b) Examples from Oxford hand detection dataset (c) Examples from VIVA hand tracking dataset(a) Examples from VIVA hand detection dataset (b) Examples from Oxford hand detection dataset (c) Examples from VIVA hand tracking dataset

Figure 12: Incorrectly detection examples using PHDN model with ResNet50 as backbone.

We show several qualitative detection examples in Fig. 10. As these resultsshow, our model can handle diﬀerent scales of hands and shapes in various illu-mination conditions, even the blurred samples. Fig. 11 compares our detectionresults with Multi-scale fast RCNN and shows the tracking results and the cor-responding ground truth on the VIVA hand tracking dataset. We can see thatour model achieves fewer false positives and produces more accurate hand loca-tions compared with the visualization results given in [12]. Besides, the modeltrained with rotated hand labels on the Oxford dataset is capable to predicthand rotation angle precisely. Further, applied into the hand tracking task, ourmodel generates satisfactory trajectories as we can see in Fig. 11. Fig. 12 showssome false detected samples. The false detections can be divided into threetypes: (1) When the color or shape of the hand is very close to the background,it may mislead the model to make false predictions or result in missed detec-tion. (2) The faces and feet with confusing colors and shapes are incorrectlydetected as hand regions by the model. (3) Heavy occlusions cause missed de-tection, e.g. , the hand obscured by the toy is not recognized in Fig. 12(b). Ourmodel does not perform well in these situations possibly because the contextinformation, such as surroundings and similar hand color or shape objects, is28ot thoroughly mined and integrated eﬀectively. We will investigate the eﬀectof context information in future work and try to address these issues.

5. Conclusion

Existing hand detection neural networks are ”black box” models and peoplecannot understand how they make automated predictions. This hinders theirapplication in areas such as driving monitoring. In this paper, we present theinterpretable Pixel-wise Hand Detection Network (PHDN). To the best of ourknowledge, this is the ﬁrst study towards interpretable hand detection. Thepixel-wise prediction shows the basis of detection and provides the model in-terpretability. Features from multiple layers are fused iteratively with cascadedHighlight Feature Fusion (HFF) blocks. This allows our model to learn bet-ter representations while reducing computation overhead. The proposed HFFblock outperforms the Base Feature Fusion (BFF) block and improves the de-tection performance signiﬁcantly. To gain insight into the reasonability of theHFF block, we visualize regions activated by the HFF block and BFF block re-spectively. The visualization results demonstrate that the HFF block highlightsthe distinctive features of diﬀerent scales and learns more discriminative ones toachieve better performance. Complex and non-transparent rotation and derota-tion layers are replaced by the rotation map to handle the rotated hand samples.The rotation map is interpretable because it directly records the rotation anglesof pixels as features. It makes the model more transparent. In addition, deepsupervision is added with auxiliary losses to accelerate the training procedure.Compared with the state-of-the-art methods, our algorithm shows competitiveaccuracy and runs a 4 .

23 times faster speed on the VIVA hand detection datasetand achieves an improvement of 5 .

5% in average precision at a speed of 62 . ReferencesReferences [1] Z. Zhang, Y. Xie, F. Xing, M. McGough, L. Yang, Mdnet: A semanticallyand visually interpretable medical image diagnosis network, in: Proceedingsof the IEEE conference on computer vision and pattern recognition, 2017,pp. 6428–6436. doi:10.1109/CVPR.2017.378 .[2] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, K.-R. M¨uller, Ex-plaining nonlinear classiﬁcation decisions with deep taylor decomposition,Pattern Recognition 65 (2017) 211–222. doi:10.1016/j.patcog.2016.11.008 .[3] S. Bambach, S. Lee, D. J. Crandall, C. Yu, Lending a hand: Detectinghands and recognizing activities in complex egocentric interactions, in: Pro-ceedings of IEEE International Conference on Computer Vision, 2015, pp.1949–1957. doi:10.1109/ICCV.2015.226 .[4] T. Horberry, J. Anderson, M. A. Regan, T. J. Triggs, J. Brown, Driverdistraction: The eﬀects of concurrent in-vehicle tasks, road environmentcomplexity and age on driving performance, Accident Analysis & Preven-tion 38 (1) (2006) 185–191. doi:10.1016/j.aap.2005.09.007 .[5] A. Rangesh, E. Ohn-Bar, M. M. Trivedi, Long-term multi-cue tracking ofhands in vehicles, IEEE Transactions on Intelligent Transportation Systems17 (5) (2016) 1483–1492. doi:10.1109/TITS.2015.2508722 .[6] P. Kakumanu, S. Makrogiannis, N. Bourbakis, A survey of skin-color mod-eling and detection methods, Pattern Recognition 40 (3) (2007) 1106–1122. doi:10.1016/j.patcog.2006.06.010 .307] A. Betancourt, P. Morerio, E. I. Barakova, L. Marcenaro, M. Rauterberg,C. S. Regazzoni, A dynamic approach and a new dataset for hand-detectionin ﬁrst person vision, in: Proceedings of International Conference ComputerAnalysis of Images and Patterns, Springer, 2015, pp. 274–287. doi:10.1007/978-3-319-23192-1_23 .[8] A. Mittal, A. Zisserman, P. Torr, Hand detection using multiple proposals,in: Proceedings of British Machine Vision Conference, 2011, pp. 75.1–75.11.[9] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutionalnetworks for accurate object detection and segmentation, IEEE Transac-tions on Pattern Analysis & Machine Intelligence 38 (1) (2016) 142–158. doi:10.1109/TPAMI.2015.2437384 .[10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg,Ssd: Single shot multibox detector, in: Proceedings of European conferenceon computer vision, 2016, pp. 21–37. doi:10.1007/978-3-319-46448-0_2 .[11] T. H. N. Le, K. G. Quach, C. Zhu, N. D. Chi, K. Luu, M. Savvides, Robusthand detection and classiﬁcation in vehicles and in the wild, in: Proceedingsof IEEE International Conference on Computer Vision & Pattern Recogni-tion Workshops, 2017, pp. 1203–1210. doi:10.1109/CVPRW.2017.159 .[12] S. Yan, Y. Xia, J. S. Smith, W. Lu, B. Zhang, Multiscale convolutionalneural networks for hand detection, Applied Computational Intelligenceand Soft Computing 2017.[13] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.[14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-tion, in: Proceedings of IEEE International Conference on Computer Vision& Pattern Recognition, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90 .3115] X. Deng, Y. Yuan, Y. Zhang, P. Tan, L. Chang, S. Yang, H. Wang, Jointhand detection and rotation estimation by using cnn, IEEE Transactionson Image Processing 27 (99). doi:10.1109/TIP.2017.2779600 .[16] L. Huang, X. Liu, Y. Liu, B. Lang, D. Tao, Centered weight normalizationin accelerating training of deep neural networks, in: Proceedings of theIEEE International Conference on Computer Vision, 2017, pp. 2803–2811. doi:10.1109/ICCV.2017.305 .[17] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Uni-ﬁed, real-time object detection, in: Proceedings of IEEE International Con-ference on Computer Vision & Pattern Recognition, 2016, pp. 779–788. doi:10.1109/CVPR.2016.91 .[18] N. Das, E. Ohn-Bar, M. M. Trivedi, On performance evaluation of driverhand detection algorithms: Challenges, dataset, and metrics, in: Proceed-ings of IEEE International Conference on Intelligent Transportation Sys-tems, 2015, pp. 2953–2958. doi:10.1109/ITSC.2015.473 .[19] Vision for intelligent vehicles and applications (VIVA).URL http://cvrr.ucsd.edu/vivachallenge/ [20] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtimetracking, in: Proceedings of IEEE International Conference on Image Pro-cessing, IEEE, 2016, pp. 3464–3468. doi:10.1109/ICIP.2016.7533003 .[21] N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with adeep association metric, in: Proceedings of IEEE International Conferenceon Image Processing, IEEE, 2017, pp. 3645–3649.[22] E. Bochinski, V. Eiselein, T. Sikora, High-speed tracking-by-detection with-out using image information, in: Proceedings of IEEE International Con-ference on Advanced Video & Signal Based Surveillance, IEEE, 2017, pp.1–6. doi:10.1109/AVSS.2017.8078516 .3223] D. Liu, D. Du, L. Zhang, T. Luo, Y. Wu, F. Huang, S. Lyu, Scale invariantfully convolutional network: Detecting hands eﬃciently, in: Proceedings ofAAAI Conference on Artiﬁcial Intelligence, 2019.[24] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,in: Proceedings of IEEE International Conference on Computer Vision &Pattern Recognition, Vol. 1, IEEE Computer Society, 2005, pp. 886–893. doi:10.1109/CVPR.2005.177 .[25] N. H. Dardas, N. D. Georganas, Real-time hand gesture detection andrecognition using bag-of-features and support vector machine techniques,IEEE Transactions on Instrumentation and Measurement 60 (11) (2011)3592–3607. doi:10.1109/tim.2011.2161140 .[26] J. Niu, X. Zhao, M. A. A. Aziz, J. Li, K. Wang, A. Hao, Human handdetection using robust local descriptors, in: Proceedings of IEEE Interna-tional Conference on Multimedia & Expo Workshops, IEEE, 2013, pp. 1–5. doi:10.1109/ICMEW.2013.6618239 .[27] T. Zhou, P. J. Pillai, V. G. Yalla, Hierarchical context-aware hand detectionalgorithm for naturalistic driving, in: Proceedings of IEEE InternationalConference on Intelligent Transportation Systems, 2016, pp. 1291–1297. doi:10.1109/ITSC.2016.7795723 .[28] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for se-mantic segmentation, in: Proceedings of IEEE International Conferenceon Computer Vision & Pattern Recognition, 2015, pp. 3431–3440.[29] P. dollr, piotrs computer vision matlab toolbox (pmt).URL http://vision.ucsd.edu/pdollar/toolbox/doc/index.html [30] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with deepconvolutional neural networks, in: Proceedings of International Conferenceon Neural Information Processing Systems, 2012, pp. 1097–1105. doi:10.1145/3065386 . 3331] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, EAST: aneﬃcient and accurate scene text detector, in: Proceedings of IEEE Inter-national Conference on Computer Vision & Pattern Recognition, 2017, pp.2642–2651. doi:10.1109/CVPR.2017.283 .[32] F. Milletari, N. Navab, S. A. Ahmadi, V-net: Fully convolutional neuralnetworks for volumetric medical image segmentation, in: Proceedings ofInternational Conference on 3d Vision, 2016, pp. 565–571. doi:10.1109/3DV.2016.79 .[33] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks forbiomedical image segmentation, in: Proceedings of International Confer-ence on Medical Image Computing & Computer-assisted Intervention, 2015,pp. 234–241. doi:10.1007/978-3-319-24574-4_28 .[34] J. Zhang, X. Shen, T. Zhuo, H. Zhou, Brain tumor segmentation basedon reﬁned fully convolutional neural networks with a hierarchical dice loss,arXiv preprint arXiv:1712.09093.[35] L. Huang, Y. Yang, Y. Deng, Y. Yu, Densebox: Unifying landmark local-ization with end to end object detection, arXiv preprint arXiv:1509.04874.[36] J. Yu, Y. Jiang, Z. Wang, Z. Cao, T. Huang, Unitbox: An advanced ob-ject detection network, in: Proceedings of Acm on Multimedia Conference,ACM, 2016, pp. 516–520. doi:10.1145/2964284.2967274 .[37] T. H. N. Le, C. Zhu, Y. Zheng, K. Luu, M. Savvides, Robust hand de-tection in vehicles, in: Proceedings of International Conference on PatternRecognition, 2017, pp. 573–578. doi:10.1109/ICPR.2016.7899695 .[38] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams,J. Winn, A. Zisserman, The pascal visual object classes challenge: A ret-rospective, Int. J. Comput. Vis. 111 (1) (2015) 98–136. doi:10.1007/s11263-014-0733-5 . 3439] A. Geiger, M. Lauer, C. Wojek, C. Stiller, R. Urtasun, 3d traﬃc scene un-derstanding from movable platforms, IEEE Transactions on Pattern Anal-ysis & Machine Intelligence 36 (5) (2014) 1012–1025. doi:10.1109/tpami.2013.185doi:10.1109/tpami.2013.185