LFFD: A Light and Fast Face Detector for Edge Devices
Yonghao He, Dezhong Xu, Lifang Wu, Meng Jian, Shiming Xiang, Chunhong Pan
LLFFD: A Light and Fast Face Detector for Edge Devices
Yonghao He ∗ , Dezhong Xu ∗ , Lifang Wu , Meng Jian , Shiming Xiang , and Chunhong Pan Faculty of Information Technology, Beijing University of Technology National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences [email protected],[email protected],[email protected],[email protected], { smxiang, chpan } @nlpr.ia.ac.cn Abstract
Face detection, as a fundamental technology for vari-ous applications, is always deployed on edge devices whichhave limited memory storage and low computing power.This paper introduces a Light and Fast Face Detector(
LFFD ) for edge devices. The proposed method is anchor-free and belongs to the one-stage category. Specifically, werethink the importance of receptive field (RF) and effectivereceptive field (ERF) in the background of face detection.Essentially, the RFs of neurons in a certain layer are dis-tributed regularly in the input image and theses RFs arenatural “anchors”. Combining RF “anchors” and appro-priate RF strides, the proposed method can detect a largerange of continuous face scales with 100% coverage in the-ory. The insightful understanding of relations between ERFand face scales motivates an efficient backbone for one-stage detection. The backbone is characterized by eightdetection branches and common layers, resulting in effi-cient computation. Comprehensive and extensive experi-ments on popular benchmarks: WIDER FACE and FDDBare conducted. A new evaluation schema is proposed forapplication-oriented scenarios. Under the new schema, theproposed method can achieve superior accuracy (WIDERFACE Val/Test – Easy: 0.910/0.896, Medium: 0.881/0.865,Hard: 0.780/0.770; FDDB – discontinuous: 0.973, contin-uous: 0.724). Multiple hardware platforms are introducedto evaluate the running efficiency. The proposed method canobtain fast inference speed ( NVIDIA TITAN Xp: 131.45FPS at 640 × × × ∗ Authors contributed equally.
MethodmAP(%) Subset
Easy Medium HardISRN[36] 0.967 0.958 0.909VIM-FD[39] 0.967 0.957 0.907DSFD[16] 0.966 0.957 0.904SRN[3] 0.964 0.952 0.901PyramidBox[28] 0.961 0.950 0.889
Table 1. Accuracy of the top-5 methods on validation set ofWIDER FACE.
1. Introduction
Face detection is a long-standing problem in computervision. In practice, it is the prerequisite to some face-relatedapplications, such as face alignment [14] and face recogni-tion [31]. Besides, face detectors are always deployed onedge devices, such as mobile phones, IP cameras and IoT(Internet of Things) sensors. These devices have limitedmemory storage and low computing power. Under suchcondition, face detectors that have high accuracy and fastrunning speed are in demand.Current state of the art face detectors have achievedfairly high accuracy on convictive benchmark WIDERFACE [33] by leveraging pre-trained heavy backbones likeVGG16 [27], Resnet50/152 [7] and Densenet121 [10]. Weinvestigate the top-5 methods on WIDER FACE and presenttheir accuracy in Table 1. It can be observed that thesemethods have similar accuracy with marginal gaps whichare hardly perceived in practical applications. It is diffi-cult and unpractical to further boost the accuracy by usingmore complex and heavier backbones. In our view, to bet-ter balance accuracy and latency is crucial for applying facedetection to more applicable areas.Face detection is a fast-growing branch of general objectdetection in the past decade. The early work of Viola-Jones1 a r X i v : . [ c s . C V ] A ug ace detector [29] proposes a classic detection framework– cascade classifiers with hand-crafted features. One of itswell-known followers is aggregate channel features (ACF)[4, 32] which can take advantages of channel features effec-tively. Although the methods mentioned above can achievefast running speed, they rely on hand-crafted features andare not trained end-to-end, resulting in not robust detectionaccuracy.Recently, convolutional neural network (CNN) basedface detectors [36, 39, 16, 3, 28, 13, 30, 34, 9, 38, 40,20, 37] show great progress partially owing to the suc-cess of WIDER FACE benchmark. These methods can beroughly divided into two categories: two-stage methods andone-stage methods. Two-stage methods [13, 30] consist ofproposal selection and localization regression, which aremainly originated from R-CNN series [6, 5, 26]. Whereas,one-stage methods [9, 38, 20, 37, 28, 3, 16, 36] coherentlycombine classification and bounding box (bbox) regression,always achieving anchor-based and multi-scale detection si-multaneously. For most one-stage methods, anchor designand matching strategy is one of the essential components. Inorder to improve the accuracy, these methods propose morecomplex modules based on heavy backbones. Although theabove methods can achieve state of the art results, they maynot properly balance accuracy and latency.In this paper, we propose a Light and Fast Face Detec-tor ( LFFD ) for edge devices, considerably balancing bothaccuracy and running efficiency. The proposed method isinspired by the one-stage and multi-scale object detectionmethod SSD [17] which also enlightens some other facedetectors [16, 28, 38]. One of the characteristics of SSDis that pre-defined anchor boxes are manually designed foreach detection branch. These boxes always have differentsizes and aspect ratios to cover objects with different scalesand shapes. Therefore, anchors play an important role inmost one-stage detection methods. For some face detec-tors [38, 40, 28, 16], sophisticated anchor strategies arecrucial parts of the contributions. However, anchor basedmethods may face three challenges: 1) anchor matching isunable to sufficiently cover all face scales. Although thiscan be relieved, it remains a problem; 2) matching anchorsto groundtruth bboxes is determined by thresholding IOU(Intersection over Union). The threshold is set empiricallyand it is difficult to make a solid investigation of its im-pact; 3) setting the number of anchors for different scalesdepends on experiences, which may induce sample imbal-ance and redundant computation.In our point of view, RF of neurons in feature maps areinherent and natural “anchors”. RF can easily handle abovechallenges. Firstly, continuous scales of faces can be pre-dicted within a certain RF size, rather than discrete scalesin anchor-based methods. Secondly, matching strategy isclear, namely a RF is matched to a groundtruth bbox if and only if its center falls in the groundtruth bbox . Thirdly,the number of RFs is naturally fixed and they are regu-larly distributed in the input image. What’s more, we makea qualitative analysis on pairing face scales and RF sizesby understanding the insights of ERF, resulting in an effi-cient backbone with eight detection branches. The back-bone only consists of common layers (conv3 ×
3, conv1 × • We study the relations of RF, ERF and face detection.The relevant understanding motivates the network de-sign. • We introduce the RF to overcome the drawbacks of theprevious anchor-based strategies, resulting in a anchor-free method. • We proposed a new backbone with common layers foraccurate and fast face detection. • Extensive and comprehensive experiments on multi-ple hardware platforms are conducted on benchmarksWIDER FACE and FDDB to firmly demonstrate thesuperiority of the proposed method for edge devices.
2. Related Work
Face detection has attracted a lot of attention since adecade ago.
Early works
Early face detectors leverage hand-craftedfeatures and cascade classifiers to detect faces in forms ofsliding window. Viola-Jones face detector [29] uses Ad-aboost with Haar-like features to train face classifiers dis-criminatively. Subsequently, utilizing more effective hand-crafted features [21, 41, 32] and more powerful classi-fiers [1, 22] becomes the mainstream. These methods arenot trained end-to-end, treating feature learning and clas-sifier training separately. Although achieving fast runningspeed, they can not obtain satisfied accuracy.
CNN-based methods
Current CNN-based face detec-tors benefit from two-stage [6, 5, 26] and one-stage [17, 23,24, 25] general object detection. Both [13] and [30] arebased on faster R-CNN [26], adapting the original fasterR-CNN to face detection. Zhang et al . [35] proposes acascaded CNN for coarse-to-fine face detection with insidecascaded structure. Recently, one-stage face detectors aredominant. MTCNN [34] performs face detection in a slid-ing window manner and relies on image pyramid. HR [9] isan advanced version of MTCNN to some extent, also requir-ing image pyramid. Image pyramid has some drawbackslike slow speed and high memory cost. S3FD [38] takes RFnto consideration for detection branch design and proposesan anchor matching strategy to improve hit rate. In [40],Zhu et al . focuses on detecting small faces by proposinga robust anchor generating and matching strategy. It canbe concluded that anchor related strategies are crucial forface detection. Following S3FD [38], PyramidBox [28] en-hances the backbone with low-level feature pyramid lay-ers (LFPN) for better multi-scale detection. SSH [20] con-structs three detection modules cooperating with contextmodules for scale-invariant face detection. DSFD [16] ischaracterized by feature enhance modules, early layer su-pervision and an improved anchor matching strategy forbetter initialization. S3FD, PyramidBox, SSH and DSFDuse VGG16 as backbones, leading to big model size andinefficient computation. FaceBoxes [37] aims to make theface detector run in real-time by rapidly reducing the sizeof input images. In detail, it reaches a large stride size 32after four layers: two convolution layers and two poolinglayers. Although the running speed of FaceBoxes is fast, itabandons the detection of small faces, resulting in relativelylow accuracy on WIDER FACE. Different from FaceBoxes,our method handles the detection of small faces delicately,achieving fast running speed and large scale coverage in themeantime. It can be observed that the networks used by re-cent state of the art methods tend to become more complexand heavier. In our view, to gain marginal improvement inaccuracy at the cost of running speed is not appropriate forpractical applications.
3. Light and Fast Face Detector
In this section, we first revisit the concept of RF and itsrelation to face detection in Sec. 3.1. Then Sec. 3.2 de-scribes the rationality and advantages of using RFs as nat-ural “anchors”. Subsequently, the details of the proposednetwork is depicted in Sec. 3.3. Finally, we present thespecifications of network training in Sec. 3.4.
In the beginning, we make a brief description of RF andits properties. RF is a definite area of the input image, whichaffects the activation of the corresponding neuron. RF de-termines the range that a neuron can see in the original in-put. Intuitively, the target object can be well detected withhigh probabilities if it is enclosed by a certain RF. In gen-eral, the neurons in shallow layers have small RFs and thosein deeper layers have large RFs. One of the important prop-erties of RF is that each input pixel contributes differentlyfor the neuron’s activation [18]. Specifically, the pixels lo-cating around the center of RF have larger impact. Andthe impact decreases gradually when the pixels are far awayfrom the center. This phenomenon is named as effectivereceptive field (ERF). ERFs inherently exist in neural net-works and present a Gaussian-like distribution. Thus, mak- ing the target object in the middle of the RF is also impor-tant. The proposed LFFD benefits from the above observa-tions.Face detection is a well-known branch of general objectdetection and it has some characteristics. First, big faces areapproximately rigid due to their unmovable components,such as eyes, noses and mouths. Although there are facialexpression changes, hair occlusion and other unconstrainedsituations, big faces are still distinguishable. Second, tinyor small faces have to be treated differently compared tobig faces. Tiny faces always have unrecognizable appear-ances (an example is shown in Fig. 1). It is even difficultfor humans to make a face/non-face decision by only see-ing the facial area of a tiny face, and the same goes for CNNbased classifiers[9]. With more context information includ-ing necks and shoulders, tiny faces become easier to recog-nize. Detailed discussion can be referred to [9].
Figure 1. Tiny faces detection. The top-left image only contains aface, and the top-right image depicts a face with sufficient contextinformation. It is easy to see that the face becomes more distin-guishable with the context information gradually increasing. Thelower part describes the relation between RF and ERF for detect-ing the tiny face.
Based on above understandings, faces with differentsizes need various RF strategies: • for tiny/small faces, ERFs have to cover the faces aswell as sufficient context information; • for medium faces, ERFs only have to contain the faceswith little context information; • for large faces, only keeping them in RFs is enough.These strategies guide us to design an effective backbone. igure 2. The overall architecture of the proposed network. The backbone has 25 convolution layers and is divided into four parts: tinypart, small part, medium part and large part. Along the backbone, there are eight loss branches which are in charge of detecting faces withdifferent scales. The entire backbone only consists of conv 3 ×
3, conv 1 ×
1, ReLU and residual connection.
One-stage detectors are mostly characterized by pre-defined bbox anchors. In order to detect different objects,anchors are in multiple aspect ratios and sizes. These an-chors are always redundantly defined. In terms of face de-tection, it is rational to use 1:1 aspect ratio anchors sincefaces are approximately square, which is also mentionedin [38, 37]. The shapes of RFs are also square if the widthand height of the kernel are equal. The proposed method re-gards RFs as natural “anchors”. For the neurons in the samelayer, their RFs are regularly tiled in the input image. Thenumber and size of RFs are inherently determined once thenetwork is built.As for matching strategy, the proposed method usesa straight and concise way – the RF is matched to agroundtruth bbox if and only if its center falls in thegroundtruth bbox, other than thresholding IOU. In the typi-cal anchor-based method S3FD [38], Zhang et al . also anal-yses the influence of ERFs and designs anchor augmenta-tion for tiny faces in particular. In spite of improving the an-chor hit rate, S3FD induces the anchor imbalance problem(too many anchors for tiny faces) which has to be addressedby additional means. However, the proposed method canachieve 100% face coverage in theory by controlling theRF stride. Besides, RF with our matching strategy can nat-urally handle continuous face scales. For an instance, RFsof 100 pixels are able to predict faces between 20 pixels to40 pixels. In this way, anchor imbalance problem is greatlyrelieved and faces from each scale are equally treated.Based on the above discussion, we do not create any an-chors and the proposed method do not really match anchorsto groundtruth bboxes. Therefore, the proposed method is
Table 2. Detailed information about the proposed network. anchor-free.
According to above analyses, we can design a spe-cialised backbone for face detection. There are two factorsthat determine the placement of loss branches – the size andstride of RFs. The size of RFs guarantees that the learnedfeatures of faces are robust and distinguishable, whereas thestride ensures the 100% coverage. The overall architectureof the proposed network is illustrated in Fig. 2. The pro-posed method can detect faces that are lager than 10 pixels(the size of a face is indicated by the longer side), sinceWIDER FACE benchmark dataset requires faces more than10 pixels to be detected. It can be observed that the pro-posed backbone is one-stage with four parts. The concreteinformation about loss branches can be found in Table 2.The tiny part has 10 convolution layers. The first two lay-rs downsample the input with stride 4, stride 2 from each.Therefore, RFs of other convolution layers in this part arein stride 4. One crucial principle is: downsample the inputas quick as possible while keeping the 100% face coverage.This part has two loss branches. The loss branch 1 stemsfrom c8 whose RF size is 55 for continuous face scale 10-15. Similarly, the loss branch 2 is from c10 with RF size 71for continuous face scale 15-20. Obviously, we can makesure that centers of at least two RFs can fall in the smallestface, thus achieving 100% coverage. There is a special casethat one center may fall in more than two faces at the sametime, in which the corresponding RF is ignored directly. Aswe have discussed in Sec. 3.1, tiny faces need more contextinformation and ERFs are smaller than RFs. To this end, weuse much larger RFs than average face scales. The ratios ofRFs and average face scales are 4.4 and 4.0 for branch 1and branch 2, respectively. In Table 2, such ratios are grad-ually decreased from 4.4 to 1.3, because larger faces needless context information. In the backbone, all convolutionlayers have the kernel size of 3 ×
3. Nevertheless, the kernelsize of convolution layers in branches is 1 × × . For the subsequentparts, their first convolution layers accomplish the samefunction. In small part, the RF increasing speed becomes16 compared to that of tiny part 8. So it takes less convolu-tion layers to reach the targeted RF sizes. The medium partis similar to the small part, having only one branch.At the end of the backbone, the large part has seven con-volution layers. These layers easily enlarge the detectionscale without too much computation gain due to small fea-ture maps. Three branches are from this part. Since bigfaces are much easier to detect, the ratios of RFs and aver-age face scales are relatively small.The proposed method can detect a large range of facesfrom 10 pixels to 560 pixels within one inference. The over-all backbone only consists of conv 3 ×
3, conv 1 ×
1, ReLUand residual connection. The main reason is that conv 3 × × ∗ , ncnn † , mace ‡ and paddle-mobile § , sincethey are most widely used. We do not adopt BN [11] ascomponents due to slow inference speed, although it hasbecome the standard configuration of many networks. Wecompare the speed between the original backbone and theone with BN: the original one can achieve 7.6 ms and the ∗ https://developer.nvidia.com/cudnn † https://github.com/Tencent/ncnn ‡ https://github.com/XiaoMi/mace § https://github.com/PaddlePaddle/paddle-mobile one with BN only has 8.9 ms, resulting in 17% slower (res-olution: 640 × In this subsection, we describe the training related detailsin several aspects.
Dataset and data augmentation.
The proposed methodis trained on the training set of WIDER FACE bench-mark [33], including 12,880 images with more than 150,000valid faces. Faces less than 10 pixels are discarded directly.Data augmentation is important for improving the robust-ness. The detailed strategies are listed as follows: • Color distort, such as random lighting noise, randomcontrast, random brightness, et al . More informationcan refer to [8, 15]. • Random sampling for each scale. In the proposed net-work, there are eight loss branches, each in charge ofa certain continuous scale. Thus, we have to guaranteethat: 1) the number of faces for each branch is approx-imately the same; 2) each face can be sampled for eachbranch with the same probability. To this end, we firstrandomly select an image, and then randomly selecta face in the image. Second, a continuous face scaleis selected and the face is randomly resized within thescale as well as the entire image and other face bboxes.Finally, we crop a sub-image of 640 ×
640 at the centerof the selected face, filling the outer space with blackpixels. • Randomly horizontal flip. We flip the cropped imagewith probability of 0.5.
Loss function.
In each loss branch, there are twosub-branches for face classification and bbox regression.For face classification, we use softmax with cross-entropyloss over two classes. The matched RF anchors are pos-itive and the others are negative. Those RF anchors withmore than one matched faces are ignored. Besides, grayscale is set for each continuous scale. Let { SL i } i =1 belower bounds of continuous scales and { SU i } i =1 for upperbounds. The lower and upper gray bounds are calculated as {(cid:98) SL i ∗ . (cid:99)} i =1 and {(cid:100) SU i ∗ . (cid:101)} i =1 . For each contin-uous scale i , the relevant gray scales are [ (cid:98) SL i ∗ . (cid:99) , SL i ] and [ SU i , (cid:100) SU i ∗ . (cid:101) ] . For example, branch 3 is for facecale 20-40, the corresponding gray scales are [18 , and [40 , . Faces that fall in gray scales are also ignored bythe corresponding branch. For bbox regression, we adopt L loss directly. The regression groundtruth is defined as: RF x − b tlx RF s / , RF y − b tly RF s / , RF x − b brx RF s / , RF y − b bry RF s / , (1)where RF x and RF y are center coordinates of the RF, b tlx and b tly are coordinates of top-left corner of the bbox, b brx and b bry are coordinates of bottom-right corner of the bboxand the normalization constant is RF s / , RF s is the RFsize. The L loss is only activated for positive RF anchorswithout being ignored. In the final loss function, the twolosses have the same weight. Hard negative mining . For each branch, negative RFanchors are usually more than positive ones. For stableand better training, only a fractional negative RF anchorsare used for back-propagation: we sort the loss values ofall negative anchors and only select the top ones for learn-ing. The ratio between the positive and negative anchors isat most 1:10. Empirically, hard negative mining can bringfaster and stable convergence.
Training parameters.
We initialize all parameters withxavier method and train the network from scratch. The in-puts first minus 127.5, and then divided by 127.5. The opti-mization method is SGD with 0.9 momentum, zero weightdecay and batch size 32. The reason for zero weight de-cay is that the number of parameters in the proposed net-work is much less than that of VGG16. Thus, there is noneed to punish. The initial learning rate is 0.1. We train1,500,000 iterations and reduce the learning rate by multi-plying 0.1 at iteration 600,000, 1,000,000, 1,200,000 and1,400,000. The training time is about 5 days with twoNVIDIA GTX1080TI. Our method is implemented usingMXNet [2] and the source code is released ¶ .
4. Experiments
In this section, comprehensive and extensive experi-ments are conducted. Firstly, a new evaluation schema isproposed and the evaluation results on benchmarks are pre-sented. Secondly, we analyse the running efficiency on mul-tiple platforms. Thirdly, we further investigate the amountof computation and storage memory cost, introducing thecomputation efficiency rate.
In this subsection, a new evaluation schema is describedat the beginning. The new schema is named as Single Infer-ence on the Original (
SIO ). SIO is proposed to reform theevaluation procedure for real-world applications. We notice ¶ https://github.com/YonghaoHe/A-Light-and-Fast-Face-Detector-for-Edge-Devices (a) Discontinuous ROC curves(b) Continuous ROC curves Figure 3. Evaluation results on FDDB. Many other publishedmethods are not displayed here for clarity. that the latency in some practical scenarios has the same im-portance as the accuracy. The conventional evaluation pro-cedure involves some tricky means, such as flips and imagepyramids, for achieving higher accuracy. However, the timeconsumption is not acceptable by doing that. To this end,SIO can be easily operated in the following way: 1) keepthe image in its original size as the net input; 2) the net doesonly one inference with the original image. The outputs ofSIO are fed to the subsequent metrics.In the experiments, we have to reproduce the results ac-cording to SIO schema. Therefore, we collect the comparedmethods which have released codes and models. Finally, thefollowing methods are taken for comparison: DSFD [16]( Resnet152 backbone ), PyramidBox [28] ( VGG16 back-bone ), S3FD [38] ( VGG16 backbone ), SSH [20] ( VGG16backbone ) and FaceBoxes [37]. DSFD and PyramidBoxare state of the art methods. The proposed method is named able 3. Performance results on the validation set of WIDERFACE. The values in () are results from the original papers.Table 4. Performance results on the testing set of WIDER FACE.The values in () are results from the original papers. as LFFD. LFFD and FaceBoxes do not rely on existing pre-trained backbones and are trained from scratch. We evaluateall methods on two benchmarks: FDDB [12] and WDIERFACE [33].
FDDB dataset.
FDDB contains 2845 images with 5171unconstrained faces. There are two types of scoring: dis-crete score and continuous score. The first scoring criterionis obtained by thresholding IOU. And the second criteriondirectly uses IOU ratios. We show final evaluation results ofLFFD on FDDB against above five methods in Fig. 3. Theoverall performance on both scoring types shows the simi-lar trends. DSFD, PyramidBox, S3FD and SSH can achievehigh accuracy with marginal gaps. The proposed LFFDgains slightly lower accuracy than the first four methods,but outperforms FaceBoxes evidently. The results indicatethat LFFD is superior for detecting unconstrained faces.
WIDER FACE dataset.
In WIDER FACE, there are32,203 images and 393,703 labelled faces. These faces arein a high degree of variability in scale, pose and occlusion.Until now, WIDER FACE is the most widely used bench-mark for face detection. All images are randomly divided into three subsets: training set (40%), validation set (10%)and testing set(50%). Furthermore, images in each subsetare graded to three levels (Easy, Medium and Hard) accord-ing to the difficulties for detection. Roughly speaking, alarge number of tiny/small faces are in Medium and Hardparts. The groundtruth annotations are available only fortraining and validation sets. All the compared methods aretrained on training set. We report the results on the valida-tion and testing sets in Table 3 and 4, respectively.Some observations can be made. Firstly, performancedrop is evident for DSFD, PyramidBox, S3FD and SSHcompared to their original results. On the one hand, achiev-ing high accuracy through only one inference is relativelydifficult. On the other hand, the tricks can indeed improvethe accuracy impressively. Secondly, PyramidBox obtainsthe best results on Hard parts, whereas the performance ofSSH on Hard parts is decreased dramatically mainly due tothe neglect of some tiny faces. Thirdly, FaceBoxes does notget desirable results on Medium and Hard parts. Since Face-Boxes produces large stride 32 rapidly, which means thatfaces smaller than 32 pixels are hardly detected. To make itclearer, we conduct additional experiments for FaceBoxes,named as FaceBoxes3.2 × , in which the both sides of in-put images are enlarged 3.2 × . We can see that the resultson Medium and Hard parts are improved remarkably. Theperformance drop on Easy parts is attributed to that somefaces are resized too large to be detected. To some extent,the results of FaceBoxes and FaceBoxes3.2 × indicate thatFaceBoxes can not cover faces with large range. Fourthly,the proposed method LFFD consistently outperforms Face-Boxes, although having gaps with state of the art methods.Additionally, LFFD is better than SSH that uses VGG16 asthe backbone on Hard parts. In this subsection, we analyse the running speed of allmethods on three different platforms. The information ofeach platform and related libraries are listed in Table 5. Weuse batchsize 1 and a few common resolutions for testing.For fair comparison, FaceBoxes3.2 × is used here insteadof FaceBoxes. The running speed is measured in ms andthe corresponding FPS . The final results are presented inTable 6, 7 and 8.In Table 6, we also add VGG16 and Resnet50 for suffi-cient comparison. SSH and S3FD are based on VGG16,having similar speed with VGG16. Whereas, Pyramid-Box is much slower due to additional complex modules, al-though based on VGG16 as well. DSFD can achieve stateof the art accuracy, but it has the slowest running speed.The proposed LFFD runs the fastest at 3840 × × obtains the highest speed at other three res-olutions. Both LFFD and FaceBoxes3.2 × can reach or evenexceed the real-time running speed ( >
30 FPS) at the first able 5. Information of hardware platforms and related runninglibraries. Table 6. Running efficiency on TITAN Xp.Table 7. Running efficiency on TX2. three resolutions. The aforementioned trend that state of theart methods pursue higher accuracy at the cost of runningspeed is clearly verified.TX2 and Raspberry Pi 3 are edge devices with low com-putation power. DSFD, PyramidBox, S3FD and SSH are ei-ther too slow or failed to run on these two platforms. Thus,we only evaluate the proposed LFFD and FaceBoxes3.2 × at lower resolutions in Table 7 and 8. The overall resultsshow that LFFD is faster than FaceBoxes3.2 × except forthe case at 640 ×
480 on Raspberry Pi 3. LFFD can betterbenefit from optimizations of ncnn than FaceBoxes3.2 × atlow resolutions 160 ×
120 and 320 × Table 8. Running efficiency on Raspberry Pi 3 Model B+.
We investigate the compared methods from the perspec-tive of parameter, computation and model size in this sub-section. The edge devices always have constrained storagememories. It is necessary to consider the memory usageof face detectors. The number of parameters is highly re-lated to the model size. However, less parameters do notmean less computation. Following [19], we use FLOPs tomeasure the computation at resolution 640 × × have light networkswhich are appropriate to deploy on edge devices. To fur-ther demonstrate the efficiency of the proposed network, wedefine a new metric: E net = F LOP s/t, (2)where t indicates the running time. E net reflects the com-putation efficiency of networks (the larger, the more effi-cient) and can be calculated at a certain resolution on aspecific platform. We compute this metric for LFFD andFaceBoxes3.2 × at 640 ×
480 on three platforms (LFFD vs.FaceBoxes3.2 × ): • • •
5. Conclusion
This paper introduces a light and fast face detector thatproperly balances accuracy and latency. By deeply rethink-ing the RF in the background of face detection, we pro-pose an anchor-free method to overcome the drawbacks ofanchor-based methods. The proposed method regards theRFs as natural “anchors” which can cover continuous facescales and reach nearly 100% hit rate. After investigatingthe essential relations between ERFs and face scales, we able 9. Number of parameters, FLOPs and model size. The modelsize may vary slightly with different libraries. delicately design an simple but efficient network with eightdetecting branches. The proposed network consists of com-mon building blocks with less filters, resulting in fast infer-ence speed. Comprehensive and extensive experiments areconducted to fully analyse the proposed method. The finalresults demonstrate that our method can achieve superioraccuracy with small model size and efficient computation,which makes it an excellent candidate for edge devices.
References [1] S. C. Brubaker, J. Wu, J. Sun, M. D. Mullin, and J. M. Rehg.On the design of cascades of boosted ensembles for face de-tection.
International Journal of Computer Vision , 77:65–86,2008. 2[2] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-cient machine learning library for heterogeneous distributedsystems. arXiv:1512.01274 , 2015. 6[3] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selec-tive refinement network for high performance face detection. arXiv:1809.02693 , 2018. 1, 2[4] P. Dollr, R. Appel, S. Belongie, and P. Perona. Fast featurepyramids for object detection.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 36(8):1532–1545, 2014.2[5] R. Girshick. Fast r-cnn. In
Proceedings of IEEE Interna-tional Conference on Computer Vision , pages 1440–1448,2015. 2[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In
Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition , pages 580–587, 2014.2[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition , pages 770–778, 2016. 1, 2 [8] A. G. Howard. Some improvements on deep convolutionalneural network based image classification. arXiv:1312.5402 ,2013. 5[9] P. Hu and D. Ramanan. Finding tiny faces. In
Proceedingsof IEEE Conference on Computer Vision and Pattern Recog-nition , pages 951–959, 2017. 2, 3[10] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In
Proceedingsof IEEE Conference on Computer Vision and Pattern Recog-nition , pages 4700–4708, 2017. 1, 2[11] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv:1502.03167 , 2015. 5[12] V. Jain and E. Learned-Miller. Fddb: A benchmark for facedetection in unconstrained settings. Technical report, Uni-versity of Massachusetts, Amherst, 2010. 7[13] H. Jiang and E. Learned-Miller. Face detection with thefaster r-cnn. In
Proceedings of IEEE International Confer-ence on Automatic Face & Gesture Recognition , pages 650–657, 2017. 2[14] X. Jin and X. Tan. Face alignment in-the-wild: A sur-vey.
Computer Vision and Image Understanding , 162:1–22,2017. 1[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
Proceedings of Advances in Neural Information ProcessingSystems , pages 1097–1105, 2012. 5[16] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang,J. Li, and F. Huang. Dsfd: dual shot face detector. In
Pro-ceedings of IEEE Conference on Computer Vision and Pat-tern Recognition , 2019. 1, 2, 3, 6[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.Fu, and A. C. Berg. Ssd: Single shot multibox detector. In
Proceedings of European Conference on Computer Vision ,pages 21–37, 2016. 2[18] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understandingthe effective receptive field in deep convolutional neural net-works. In
Proceedings of Advances in Neural InformationProcessing Systems , pages 4898–4906, 2016. 3[19] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz.Pruning convolutional neural networks for resource efficientinference. arXiv:1611.06440 , 2016. 8[20] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis.Ssh: Single stage headless face detector. In
Proceedings ofIEEE International Conference on Computer Vision , pages4875–4884, 2017. 2, 3, 6[21] T. Ojala, M. Pietikinen, and T. Menp. Multiresolution gray-scale and rotation invariant texture classification with localbinary patterns.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 24:971–987, 2002. 2[22] M.-T. Pham and T.-J. Cham. Fast training and selection ofhaar features using statistics in boosting-based face detec-tion. In
Proceedings of IEEE International Conference onComputer Vision , pages 1–7, 2007. 2[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. Youonly look once: Unified, real-time object detection. In
Pro-ceedings of IEEE Conference on Computer Vision and Pat-tern Recognition , pages 779–788, 2016. 224] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.In
Proceedings of IEEE Conference on Computer Vision andPattern Recognition , pages 7263–7271, 2017. 2[25] J. Redmon and A. Farhadi. Yolov3: An incremental improve-ment. arXiv:1804.02767 , 2018. 2[26] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In
Proceedings of Advances in Neural Information ProcessingSystems , pages 91–99, 2015. 2[27] K. Simonyan and A. Zisserman. Very deep con-volutional networks for large-scale image recognition. arXiv:1409.1556 , 2014. 1, 2[28] X. Tang, D. K. Du, Z. He, and J. Liu. Pyramidbox: Acontext-assisted single shot face detector. In
Proceedings ofEuropean Conference on Computer Vision , pages 797–813,2018. 1, 2, 3, 6[29] P. Viola and M. J. Jones. Robust real-time face detection.
International Journal of Computer Vision , 57(2):137–154,2004. 2[30] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. arXiv:1706.01061 , 2017. 2[31] M. Wang and W. Deng. Deep face recognition: A survey. arXiv:1804.06655 , 2018. 1[32] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channelfeatures for multi-view face detection. In
Proceedings ofIEEE International Joint Conference on Biometrics , pages1–8, 2014. 2[33] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A facedetection benchmark. In
Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition , pages 5525–5533, 2016. 1, 5, 7[34] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detectionand alignment using multitask cascaded convolutional net-works.
IEEE Signal Processing Letters , 23(10):1499–1503,2016. 2[35] K. Zhang, Z. Zhang, H. Wang, Z. Li, Y. Qiao, and W. Liu.Detecting faces using inside cascaded contextual cnn. In
Pro-ceedings of IEEE International Conference on Computer Vi-sion , pages 3171–3179, 2017. 2[36] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, T. Mei,and S. Z. Li. Improved selective refinement network for facedetection. arXiv:1901.06651 , 2019. 1, 2[37] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.Faceboxes: A cpu real-time face detector with high accuracy.In
Proceedings of IEEE International Joint Conference onBiometrics , pages 1–9, 2017. 2, 3, 4, 6[38] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.S3fd: Single shot scale-invariant face detector. In
Proceed-ings of IEEE International Conference on Computer Vision ,pages 192–201, 2017. 2, 3, 4, 6[39] Y. Zhang, X. Xu, and X. Liu. Robust and high performanceface detector. arXiv:1901.02350 , 2019. 1, 2[40] C. Zhu, R. Tao, K. Luu, and M. Savvides. Seeing smallfaces from robust anchor’s perspective. In
Proceedings ofIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 5127–5136, 2018. 2, 3 [41] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan. Fast humandetection using a cascade of histograms of oriented gradi-ents. In