[PDF] BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices

Abstract

Face masks have long been used in many areas of everyday life to protect against the inhalation of hazardous fumes and particles. They also offer an effective solution in healthcare for bi-directional protection against air-borne diseases. Wearing and positioning the mask correctly is essential for its function. Convolutional neural networks (CNNs) offer an excellent solution for face recognition and classification of correct mask wearing and positioning. In the context of the ongoing COVID-19 pandemic, such algorithms can be used at entrances to corporate buildings, airports, shopping areas, and other indoor locations, to mitigate the spread of the virus. These application scenarios impose major challenges to the underlying compute platform. The inference hardware must be cheap, small and energy efficient, while providing sufficient memory and compute power to execute accurate CNNs at a reasonably low latency. To maintain data privacy of the public, all processing must remain on the edge-device, without any communication with cloud servers. To address these challenges, we present a low-power binary neural network classifier for correct facial-mask wear and positioning. The classification task is implemented on an embedded FPGA, performing high-throughput binary operations. Classification can take place at up to ~6400 frames-per-second, easily enabling multi-camera, speed-gate settings or statistics collection in crowd settings. When deployed on a single entrance or gate, the idle power consumption is reduced to 1.6W, improving the battery-life of the device. We achieve an accuracy of up to 98% for four wearing positions of the MaskedFace-Net dataset. To maintain equivalent classification accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms are tested for their ability to generalize the relevant features over all subjects using the Grad-CAM approach.

Full PDF

BBinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wearand Positioning Predictor on Edge Devices

Nael Fasfous , Manoj-Rohit Vemparala , Alexander Frickenstein , Lukas Frickenstein , Walter Stechele Technical University of Munich ( .@tum.de ) BMW Group ( .@bmw.de ) Abstract — Face masks have long been used in many areas ofeveryday life to protect against the inhalation of hazardousfumes and particles. They also offer an effective solutionin healthcare for bi-directional protection against air-bornediseases. Wearing and positioning the mask correctly is essentialfor its function. Convolutional neural networks (CNNs) offeran excellent solution for face recognition and classiﬁcationof correct mask wearing and positioning. In the context ofthe ongoing COVID-19 pandemic, such algorithms can beused at entrances to corporate buildings, airports, shoppingareas, and other indoor locations, to mitigate the spread ofthe virus. These application scenarios impose major challengesto the underlying compute platform. The inference hardwaremust be cheap, small and energy efﬁcient, while providingsufﬁcient memory and compute power to execute accurateCNNs at a reasonably low latency. To maintain data privacyof the public, all processing must remain on the edge-device,without any communication with cloud servers. To addressthese challenges, we present BinaryCoP, a low-power binaryneural network classiﬁer for correct facial-mask wear andpositioning. The classiﬁcation task is implemented on an em-bedded FPGA accelerator, performing high-throughput binaryoperations. Classiﬁcation can take place at up to ∼ I. I

NTRODUCTION

Convolutional neural networks (CNNs) have been appliedto real-world problems since the early days of their concep-tion [1]. In current times, the ongoing COVID-19 pandemicpresents new challenges, which can be solved with the helpof state-of-the-art computer vision algorithms [2], [3]. Oneof the most simple ways of mitigating the spread of theCOVID-19 disease is wearing a face-mask, which can protectthe wearer from direct exposure to the virus through themouth and nasal passages. A correctly worn mask can alsoprotect other people, in case the wearer is already infectedwith the disease. This bi-directional protection makes maskshighly effective in crowded and/or indoor areas. Althoughface-masks have become a mandatory requirement in manypublic areas, it is difﬁcult to ensure the compliance of the ∗ Equally contributed general public. More speciﬁcally, it is difﬁcult to assert thatthe masks are worn correctly as intended, i . e . completelycovering the nose, mouth and chin [4].CNNs are the current state-of-the-art in face detectionapplications. Compared to classical computer vision algo-rithms, CNNs can provide better accuracy on problems withdiverse features without having to manually extract saidfeatures [5]. This holds true only when the training datasethas a fair distribution of samples. Correctly identifying amask on a person’s face is a relatively simple task for thesepowerful algorithms. However, a more precise classiﬁcationof the exact positioning of the mask and identifying theexposed region of the face is more challenging. To maintainequivalent classiﬁcation accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms must be ableto generalize the relevant features over all subjects.The deployment scenarios for the CNN should also betaken into consideration. A face-mask detector can be set atthe entrance of corporate buildings, shopping areas, airportcheckpoints, and speed gates. These distributed settings re-quire cheap, battery-powered, edge devices which are limitedin memory and compute power. To maintain security anddata privacy of the public, all processing must remain on theedge-device without any communication with cloud servers.Minimizing power and resource utilization while maintain-ing a high classiﬁcation accuracy is a design challenge whichnecessitates hardware-software co-design. In this context,we propose Binary-CoP (Binary COVID-mask Predictor), anefﬁcient binary neural network (BNN) classiﬁer for real-timeclassiﬁcation of correct face-mask wear and positioning. Thechallenges of the described application are tackled throughthe following contributions: • Training BNNs on synthetically generated data [6] tocover a wide demographic and generalize relevant task-related features. A high accuracy of ∼

98% is achievedfor a 4-class problem of mask wear and positioning. • Deploying BNNs on a low-power, real-time embeddedFPGA accelerator based on the Xilinx FINN architec-ture [7]. The accelerator can operate at a low-powerof ∼ ∼ • The BNNs are analyzed through Gradient-weightedClass Activation Mapping (Grad-CAM) to improve in-terpretability and study the features being learned. a r X i v : . [ c s . C V ] F e b I. R

ELATED W ORK

A. COVID-19 Face-Mask Wear and Positioning

Correctly worn masks play a pivotal role in mitigatingthe spread of the COVID-19 disease during the ongoingpandemic [8]. Members of the general public often underes-timate the importance of this simple yet effective methodof disease prevention and control. Researchers and datascientists in the ﬁeld of computer vision have collecteddata to train and deploy algorithms which help in au-tomatically regulating masks in public spaces and indoorlocations [9], [10]. Although large-scale natural face datasetsexist, the number of real-world masked images is limited [9].Wang et al. [10] extended their masked-face dataset witha Simulated Masked Face Recognition Dataset (SMFRD),which is synthetically generated by applying virtual masksto existing natural face datasets. Cabani et al. [6] improvedthe generation of synthetically masked-faces by applying adeformable mask-model onto natural face images with thehelp of automatically detected facial key-points. The key-points of the deformable mask-model can be matched to thekey-points of the face, allowing the application of the mask ina variety of ways. This allows the dataset generation processto further generate examples of incorrectly worn masks, suchas chin exposed, nose exposed or nose and mouth exposed.

B. Binary Neural Networks

The memory footprint of neural networks and the com-plexity of their arithmetic operations on inference hardwarecan be reduced through parameter quantization. In the mostextreme case, binarizing neural networks constrains theirweights and activations to {− , } , such that their memoryfootprint is theoretically reduced by × compared to aﬂoat-32 CNN [11]. Additionally, simple XNOR and popcount operations can be used to implement multiply-accumulate(MAC) operations on inference hardware [12]. Specializedtraining schemes have been proposed to mitigate the lossin information capacity introduced by the low-bitwidth rep-resentation of BNNs [11], [13], [14], [15], [12]. In somecases, the low information capacity due to binarization canhave a regularization effect which improves feature general-ization [13]. This is helpful in improving the classiﬁcationperformance on real-world data, particularly when trainingon synthetically generated data [16]. In [13], Courbariaux etal. introduced a scheme to train neural networks with binaryweights during forward propagation while maintaining latentfull-precision values during back propagation. This ensuresproper gradient ﬂow and ﬁne adjustments through the gra-dients. This approach is later extended by the binarizationof activations [11]. Rastegari et al. [12] proposed XNOR-Net, where both weights and activations are binarized suchthat the convolutions of input feature maps and weightscan be approximated by a combination of

XNOR operationsand popcounts , followed by a multiplication with scalingfactors. The introduction of scaling factors improves theinformation capacity of the network at the cost of moretrainable parameters for each layer. This adds to the compu- tational complexity of XNOR-Net at deployment time. Forthe task of face-mask detection with low scene complexity,more efﬁcient forms of BNNs [11] can be applied.

C. BNN Hardware Accelerators

Several accelerators have been designed to exploit thebeneﬁts of BNNs [17], [18], [7], [19], [20], [21], [22].The Xilinx FINN [7] framework was developed to acceler-ate BNNs efﬁciently on FPGA platforms. The frameworkcompiles high level synthesis (HLS) code from a BNNdescription to create a hardware design for the network.The generated streaming architecture consists of a pipelineof individual hardware components instantiated for eachlayer of the BNN. In this work, we deploy Binary-CoP onFINN-based hardware architectures to achieve an efﬁcientacceleration of the masked-face inference on embeddedFPGAs. We parameterize and synthesize accelerators withdifferent hardware requirements, geared towards individualCOVID-19 mask recognition (low-power) or crowd statisticscollection (high-performance).III. M

ETHOD

This section describes the building blocks of Binary-CoPfor the classiﬁcation of correct mask wear and positioning.

A. Training and Inference of Binary Neural Networks

The BNN method proposed by Courbariaux et al. [11]serves as our foundation to efﬁciently approximate weightsand activations to single bit precision at inference time,such that the neural network’s arithmetic operations canbe executed by simple logic operations. Smooth modeltraining and convergence is ensured by relying on full-precision latent weights W during training time [23]. Indetail, the activation tensor A l − ∈ R X i × Y i × C i , with itsdimensions of X i width, Y i height, and C i channels, servesas the input to the convolutional layer l ∈ [1 , ..., L ] . Here, A and A L represent the input image and the network’sprediction, respectively. The trainable parameters of the2D-convolutional layers are composed of the latent weightmatrix W ∈ R K × K × C i × C o required for training, with kerneldimension K , input channels C i , and output channels C o . Aspreviously stated, the latent weights are mapped to {− , +1 } during the forward pass for loss calculation or deployment,resulting in the binarized b ⊂ B ∈ B K × K × C i × C o . In thehardware implementation, − is expressed as a binary to perform multiplications as XNOR logic operations. The sign () function in Eq. 1 is used to binarize the input featuremaps and weights. b = sign ( w ) = (cid:26) if w ≥ , − otherwise . (1)The derivative of the sign () function is almost always zero,resulting in insufﬁcient gradient ﬂow during training andback-propagation. This necessitates gradient ﬂow approxi-mation using a straight-through estimator (STE) [23].Particularly for BNNs, it is of crucial importance to adjustthe input elements a l − ⊂ A l − , before the approximation nput signsignBinary Conv Binary CoP-Net FPGA-based Accelerator

Binary convolutional layer:

Binary CoP-Net

Battery-powered DeviceCrowd Feature H i g h - P e r f o r m a n c e P o w e r - S a v i n g Prediction

Correctly MaskedUncovered ChinUncovered NoseUncovered Nose & Mouth

Camera Application Scenario

BinaryActivationsand Weights(<15k Byte)

MVTU: • PEs • SIMD • H l-1 , B l • Pipeline Buffers • SWU • Pipeline Buffers • Sliding Window Unit (SWU)

XNOR

PopCnt

WeightMemory A cc u m u l a t o r T h r e s h o l d M e m o r y + >> Processing Element (PE) W r i t e t o P E s SynthesisPerformanceup to 6400 Frames per Second Power ~ BatchNorm

Fig. 1: Schematic representation of the Binary-CoP accelerator. A camera captures images to be classiﬁed by the neuralnetwork. The BNN accelerator is tailored for the application scenario (single gate prediction or crowd statistics collection).Binary tensors are processed in the PEs of the FPGA-based accelerator using XNOR operations. The classiﬁcation of theinput data is available after completion of the computations at low-latency or low-power.into the binary representation h l − ⊂ H l − ∈ B X i × Y i × C i by means of batch normalization to zero mean and unitvariance. An advantage of BNNs is that the result of thebatch-norm operation is followed by sign () (see Fig. 1).Since the result after applying both functions is simply {− , } , the precise calculation of the batch-norm is wastefulon embedded hardware. Based on the batch-norm statisticscollected at training time, a threshold point τ is deﬁned,wherein an activation value a l − ≥ τ results in 1, otherwise-1 [7]. This allows the implementation of the typically costlybatch-norm operation as a simple magnitude comparisonoperation on hardware.Next, the binary convolution follows as: H l − = sign ( BatchNorm ( A l − )); B l = sign ( W l ) (2) A l = BinConv ( H l − , B l ) = PopCnt ( XNOR ( H l − , B l )) , (3)which results in the output feature map A l ∈ R X o × Y o × C o . B. Hardware Architecture

The trained BNNs are conditioned for deployment on theXilinx FINN framework [7]. The pipelined architecture offersseveral advantages on embedded devices, most importantly,the reduction in on-chip to off-chip memory transfers ofthe BNN parameters B l and intermediate activations A l and H l . This is mainly feasible due to the binary format, whichresults in highly compact neural networks that can ﬁt on theon-chip memory units of embedded devices. The number ofprocessing elements (PEs), single-instruction-multiple-data(SIMD)-lanes, and other parameters can be optimized by the designer to suit the acceleration of the trained BNN. Theﬁnal design is synthesized and implemented on an embeddedFPGA.For each convolutional or fully-connected layer in theBNN, a matrix-vector-threshold unit (MVTU) is instantiated,which executes the XNOR , popcount and threshold operationsmentioned in Sec. III-A. Each MVTU in the pipeline can bedimensioned for the number of PEs and SIMD lanes, whichhave a signiﬁcant impact on hardware resource utilization,latency and the effective throughput of the pipeline. Basedon the compute complexity of each layer, the availablehardware resources need to be distributed over the corre-sponding MVTUs, such that all parts of the pipeline havea matched-throughput. A single under-dimensioned MVTUcould throttle the entire pipeline, resulting in sub-optimalthroughput. A single MVTU of the pipeline is shown inFig. 1, and a corresponding PE is detailed.For convolutional layers, an additional sliding-windowunit (SWU) reshapes the binarized activation maps to create asingle, wide input feature map memory, which can efﬁcientlybe accessed by the corresponding MVTU. Max-pool layersare implemented as boolean OR operations, since a singlebinary “1” value sufﬁces to make the entire pool windowoutput equal to 1. C. BNN Interpretability with Grad-CAM

The output of the convolutional layers in a CNN containslocalized information of the input image, without any priorbias on the location of objects and features during training.This information can be captured using Class Activationapping (CAM) [24] and Gradient-weighted Class Activa-tion Mapping (Grad-CAM) [25] techniques. To apply CAM,the model must end with a global average pooling layerfollowed by a fully-connected layer, providing the logits of aparticular input. The BNN models investigated in this workoperate on a small input resolution of 32 ×

32, and achieve ahigh reduction of spatial information without incorporating aglobal average pooling layer. For this reason, the Grad-CAMapproach is better-suited to obtain visual interpretations ofBinary-CoP’s attention and determine the important regionsfor its predictions of different inputs and classes.To obtain the class-discriminative localization map, weconsider the activations and gradients for the output of the conv2 2 layer, which has spatial dimensions of 5 ×

5. Weuse average pooling for the corresponding gradients andreduce the channels by performing Einstein summation asspeciﬁed in [25]. With this approach the base networks do notneed any modiﬁcations or retraining. Due to the syntheticallygenerated dataset used for training, we expect Binary-CoPmodels to generalize well against domain shifts.IV. R

ESULTS AND D ESIGN S PACE E XPLORATION

A. Experimental Setup

Binary-CoP is able to detect the presence of a mask, aswell as its position and correctness. This level of classi-ﬁcation detail is possible through the more detailed splitof the MaskedFace-Net dataset [6] from 2 classes, namelyCorrectly Masked Face Dataset (CMFD) and IncorrectlyMasked Face Dataset (IMFD), to 4 classes of CMFD, IMFDNose, IMFD Chin, and IMFD Nose and Mouth. The datasetsuffers from high imbalance in the number of samples perclass. From the total 133,783 samples, roughly 5% of thesamples are IMFD Chin, and another 5% samples are IMFDNose and Mouth. CMFD samples make up 51% of the totaldataset while IMFD Nose makes up 39%. The dataset inits raw distribution would heavily bias the training towardsthe two dominant classes. To counter this, we randomlysample the larger classes CMFD and IMFD Nose to collecta comparable number of examples to the two remainingclasses, IMFD Chin and IMFD Nose and Mouth. The evenlybalanced dataset is then randomly augmented with a varyingcombination of contrast, brightness, gaussian noise, ﬂip androtate operations. The ﬁnal size of the balanced dataset is110K train and validation examples and 28K test samples.The images are resized to 32 ×

32 pixels, similar to theCIFAR-10 [26] dataset. The BNNs are trained up to 300epochs, unless learning saturates earlier. The full-precision(FP32) variant used for the Grad-CAM comparison is trainedfor 175 epochs due to early learning saturation (98.6% ﬁnaltest accuracy). We trained the BNN architectures shownin Tab. I according to the method described in Sec.III-A.Each convolutional (Conv) and fully connected (FC) layeris followed by batch-norm and activation layers except forthe ﬁnal layer. Conv groups 1 and 2 are followed by a max-pool layer. The target System-on-Chip (SoC) platform forthe experiments is the Xilinx XC7Z020 (Z7020) chip. The µ -CNV design can also be synthesized for the more constrained XC7Z010 (Z7010) chip, when XNOR operations are ofﬂoadedto the DSP blocks as described in [27]. Power and throughputmeasurements are taken directly on a running system. Thepower is measured at the power supply of the board (includesboth PS and PL). The throughput reported is the classiﬁcationrate when the accelerator’s pipeline is full.TABLE I: Network architectures and hardware dimensioning.

16, 32, 16, 16, 4, 1, 1, 1, 4 16, 16, 16, 16, 4, 1, 1, 1, 1 4, 4, 4, 4, 1, 1, 1

SIMD lanes

3, 32, 32, 32, 32, 32, 4, 8, 1 3, 16, 16, 32, 32, 32, 4, 8, 1 3, 16, 16, 32, 32, 16, 1

B. Design Space Exploration

We evaluate three Binary-CoP prototypes, namely CNV, n -CNV and µ -CNV. The CNV network is based on thearchitecture in [7] inspired by VGG-16 [28] and Bina-ryNet [11]. n -CNV is a downsized version for a smallermemory footprint, and µ -CNV has fewer layers to reduce thesize of the synthesized design. All designs are synthesizedwith a target clock frequency of 100MHz.In Tab. II, the hardware utilization for the Binary-CoPprototypes is provided. With µ -CNV, a signiﬁcant reductionin LUTs is achieved, which makes the design synthesizableon the heavily constrained Z7010 SoC. The trade-off is aslight increase in the memory footprint of the BNN, asthe shallower network has a larger spatial dimension beforethe fully-connected layers, increasing the total number ofparameters after the last convolutional layer. The choice ofPE count and SIMD lanes for the n -CNV prototype allowit to reach a maximum throughput of ∼ . W, which is required mostly by thesoft-core on the SoC. In this setting, a classiﬁcation needsto be triggered only when a subject is attempting to passthrough the entrance where Binary-CoP is deployed.

C. Grad-CAM Analysis

The confusion matrix in Fig. 2 shows the generalization ofBinary-CoP-CNV on all classes after balancing the dataset.We further analyze the output heat maps generated by Grad-CAM to interpret the predictions of our BNNs with respectto the diverse attributes of the MaskedFace-Net dataset. InFig. 3 - Fig. 9, column 1 and 2 indicate the label and inputimage respectively. Columns 3, 4 and 5 highlight the heatmaps obtained from the Grad-CAM output of Binary-CoP-CNV, Binary-CoP- n -CNV and a full-precision version ofCNV with ﬂoat-32 parameters (FP32). The heat maps areoverlayed on the raw input images for better visualization.All raw images chosen have been classiﬁed correctly by allABLE II: Hardware results of design space exploration. Conﬁguration LUT BRAM DSP Acc.

CNV 26060 124 24 n -CNV 20425

14 93.94 µ -CNV

14 27 93.78

Correct 712598% 411% 10% 901%Nose 260% 704298% 942% 260%N+M 40% 791% 565198% 90%Chin 1071%Correct 411%Nose 70%N+M 736398%Chin T r u e C l a ss Predicted Class

Fig. 2: Confusion matrix of Binary-CoP-CNV on the test set.the networks, for fair interpretation of feature-to-predictioncorrelation.In Fig. 3, we analyze the Region of Interest (RoI) forthe correctly masked class. Binary-CoP’s learning capacityallows it to focus on key facial lineaments of the humanwearing the mask, rather than the mask itself. This potentiallyhelps in generalizing on other mask types. For the childexample shown in the ﬁrst row, the focus of Binary-CoPvariants lies on the nose, making sure that it is fully coveredby the mask. Similarly, for the adult in row 2, Binary-CoP-CNV focuses on the exposed cheekbones, to assert itscorrect mask prediction. This also holds for our small versionof Binary-CoP, with signiﬁcantly reduced learning capacity.The RoI curves ﬁnely above the mask, tracing the exposedregion of the face. In the third row example, Binary-CoP-CNV falls back to focusing on the mask, whereas Binary-

Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedCorrectlyMasked Fig. 3: Grad-CAM results for the correctly-masked class.

Label Raw BCoP BCoP FP32CNV n -CNVNoseExposedNoseExposedNoseExposed Fig. 4: Grad-CAM results for the nose-exposed class.

Label Raw BCoP BCoP FP32CNV n -CNVNoseMouthExposedNoseMouthExposedNoseMouthExposed Fig. 5: Grad-CAM results for the nose and mouth-exposedclass.CoP- n -CNV continues to focus on the exposed features.Both models achieve the same prediction by focusing ondifferent parts of the raw image. In contrast to the Binary-CoP variants, the full precision FP32 model seems to focuson a combination of several different features. This can beattributed to its larger learning capacity.In Fig. 4, we analyze the Grad-CAM output of the uncov-ered nose class. Binary-CoP-CNV and Binary-CoP- n -CNVfocus speciﬁcally on two regions, namely the nose and thestraight upper edge of the mask. These clear characteristicscannot be observed with the oversized FP32 CNN. In Fig 5,the results show the RoI for predicting the exposed mouthand nose class. All models seem to distribute their attentiononto several exposed features of the face.Fig. 6 shows Grad-CAM results which predict the chinbeing exposed. The top region of the mask points upwards,similar to the correctly worn mask. Therefore, the BNNs payless attention to this region and instead focus on the neckand chin. With the full precision FP32 model, it is difﬁcultto interpret the reason for the correct classiﬁcation, as littleto no focus is given to the chin region.Beyond studying the BNNs’ behavior on different classpredictions, we can use the attention heat maps to understandthe generalization behavior of the classiﬁer. In Fig. 7 - Fig. 9, abel Raw BCoP BCoP FP32CNV n -CNVChinExposedChinExposedChinExposed Fig. 6: Grad-CAM results for the chin-exposed class.

Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedCorrectlyMasked Fig. 7: Grad-CAM results for age generalization.we test Binary-CoP’s generalization over ages, hair colorsand head gear, as well as complete face manipulation withdouble-masks, face paint and sunglasses. In Fig 7, we see thatthe smaller eyes of infants and elderly do not hinder Binary-CoP’s ability to focus on the top region of the correctlyworn masks. In Fig. 8, Binary-CoP-CNV shows resilienceto differently colored hair and head-gear, even when havinga similar light-blue color as the face-masks (row 2 and3). In contrast, the FP32 model’s attention seems to shifttowards the hair and head-gear for these cases. Finally, inFig. 9, both Binary-CoP variants focus on relevant featuresof the corresponding label, irrespective of the obscured ormanipulated faces. This empirically shows that the complextraining of BNNs, along with their lower information ca-pacity, constrains them to focus on a smaller set of relevantfeatures, thereby generalizing well for unprecedented cases.V. C

ONCLUSION

In this paper, we apply binary neural networks to thetask of classifying the correctness of face-mask wear andpositioning. In the context of the ongoing COVID-19 pan-demic, such algorithms can be used at entrances to corporatebuildings, airports, shopping areas, and other indoor locationsto mitigate the spread of the virus. Applying BNNs to thisapplication solves several challenges such as (1) Maintainingdata privacy of the public by processing data on the edge-

Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedNoseExposed Fig. 8: Grad-CAM results for hair/headgear generalization.

Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedChinExposedNoseMouthExposedNoseMouthExposed Fig. 9: Grad-CAM results for face manipulation with double-masks, face paint and sunglasses.device, (2) Deploying the classiﬁer on an efﬁcient XNOR-based accelerator to achieve low-power computation, and(3) Minimizing the neural network’s memory footprint byrepresenting all parameters in the binary domain, enablingdeployment on low-cost, embedded hardware. The acceler-ator requires only ∼ ∼ for four wearing positions of theMaskedFace-Net dataset. The Grad-CAM approach is usedto study the features learned by the proposed Binary-CoPclassiﬁer. The results show the classiﬁer’s high generaliza-tion ability, allowing it to perform well on different facestructures, skin-tones, hair types, and age groups.R EFERENCES[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,”

Proceedings of the IEEE ,ol. 86, no. 11, pp. 2278–2324, 1998.[2] L. Wang and A. Wong, “Covid-net: A tailored deep convolutionalneural network design for detection of covid-19 cases from chest x-ray images,” 2020.[3] A. I. Khan, J. L. Shah, and M. M. Bhat, “Coronet: Adeep neural network for detection and diagnosis of covid-19from chest x-ray images,”

Computer Methods and Programs inBiomedicine

Advances in Computer Vision , K. Araiand S. Kapoor, Eds. Cham: Springer International Publishing, 2020,pp. 128–144.[6] A. Cabani, K. Hammoudi, H. Benhabiles, and M. Melkemi,“Maskedface-net – a dataset of correctly/incorrectly maskedface images in the context of covid-19,”

Smart Health

Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays , ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 65–74.[Online]. Available: http://doi.acm.org/10.1145/3020078.3021744[8] T. Mitze, R. Kosfeld, J. Rode, and K. W¨alde, “Face masksconsiderably reduce covid-19 cases in germany,”

Proceedings of theNational Academy of Sciences , 2017, pp. 426–434.[10] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, H. Wu, P. Yi,K. Jiang, N. Wang, Y. Pei, H. Chen, Y. Miao, Z. Huang, and J. Liang,“Masked face recognition dataset and application,” 2020.[11] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,and Y. Bengio, “Binarized neural networks,” in

Advancesin Neural Information Processing Systems 29 . CurranAssociates, Inc., 2016, pp. 4107–4115. [Online]. Available:http://papers.nips.cc/paper/6573-binarized-neural-networks.pdf[12] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classiﬁcation Using Binary Convolutional NeuralNetworks,” in

The European Conference on Computer Vision (ECCV) .Cham: Springer International Publishing, 2016, pp. 525–542.[13] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Train-ing deep neural networks with binary weights during propagations,”in

Advances in Neural Information Processing Systems (NeurIPS) ,C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,Eds. Curran Associates, Inc., 2015, pp. 3123–3131.[14] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “BNN+:improved binary network training,”

CoRR , vol. abs/1812.11800, 2018.[Online]. Available: http://arxiv.org/abs/1812.11800[15] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutionalneural network,” in

Advances in Neural Information Processing Sys-tems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc.,2017, pp. 345–353. [Online]. Available: http://papers.nips.cc/paper/6638-towards-accurate-binary-convolutional-neural-network.pdf[16] A. Frickenstein, M.-R. Vemparala, J. Mayr, N.-S. Nagaraja, C. Unger,F. Tombari, and W. Stechele, “Binary DAD-Net: Binarized DrivableArea Detection Network for Autonomous Driving,” in

InternationalConference on Robotics and Automation (ICRA) , Paris, France, 2020.[17] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An ar-chitecture for ultralow power binary-weight cnn acceleration,”

IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems , vol. 37, no. 1, pp. 48–60, Jan 2018.[18] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara,S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Mo-tomura, “Brein memory: A single-chip binary/ternary reconﬁgurablein-memory deep neural network accelerator achieving 1.4 tops at 0.6w,”

IEEE Journal of Solid-State Circuits , vol. 53, no. 4, pp. 983–994,April 2018. [19] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn,”

Neurocomput. , vol. 275, no. C, p. 1072–1086, Jan. 2018. [Online].Available: https://doi.org/10.1016/j.neucom.2017.09.046[20] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “Fbna: Afully binarized neural network accelerator,” in ,2018, pp. 51–513.[21] E. Nurvitadhi, D. Shefﬁeld, Jaewoong Sim, A. Mishra, G. Venkatesh,and D. Marr, “Accelerating binarized neural networks: Comparison offpga, cpu, gpu, and asic,” in , 2016, pp. 77–84.[22] C. Fu, S. Zhu, H. Su, C.-E. Lee, and J. Zhao, “Towards fastand energy-efﬁcient binarized neural network inference on fpga,” in

Proceedings of the 2019 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , ser. FPGA ’19. New York, NY,USA: Association for Computing Machinery, 2019, p. 306. [Online].Available: https://doi.org/10.1145/3289602.3293990[23] Y. Bengio, N. L´eonard, and A. C. Courville, “Estimating orpropagating gradients through stochastic neurons for conditionalcomputation,”

CoRR , vol. abs/1308.3432, 2013. [Online]. Available:http://arxiv.org/abs/1308.3432[24] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,“Learning deep features for discriminative localization,” in

IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR) , June2016, pp. 2921–2929.[25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in

Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[26] A. Krizhevsky, “Learning multiple layers of features from tiny im-ages,”

University of Toronto , 2009.[27] N. Fasfous, M. R. Vemparala, A. Frickenstein, and W. Stechele, “Or-thruspe: Runtime reconﬁgurable processing elements for binary neuralnetworks,” in , 2020, pp. 1662–1667.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in