BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices
Nael Fasfous, Manoj-Rohit Vemparala, Alexander Frickenstein, Lukas Frickenstein, Walter Stechele
BBinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wearand Positioning Predictor on Edge Devices
Nael Fasfous , Manoj-Rohit Vemparala , Alexander Frickenstein , Lukas Frickenstein , Walter Stechele Technical University of Munich (
NTRODUCTION
Convolutional neural networks (CNNs) have been appliedto real-world problems since the early days of their concep-tion [1]. In current times, the ongoing COVID-19 pandemicpresents new challenges, which can be solved with the helpof state-of-the-art computer vision algorithms [2], [3]. Oneof the most simple ways of mitigating the spread of theCOVID-19 disease is wearing a face-mask, which can protectthe wearer from direct exposure to the virus through themouth and nasal passages. A correctly worn mask can alsoprotect other people, in case the wearer is already infectedwith the disease. This bi-directional protection makes maskshighly effective in crowded and/or indoor areas. Althoughface-masks have become a mandatory requirement in manypublic areas, it is difficult to ensure the compliance of the ∗ Equally contributed general public. More specifically, it is difficult to assert thatthe masks are worn correctly as intended, i . e . completelycovering the nose, mouth and chin [4].CNNs are the current state-of-the-art in face detectionapplications. Compared to classical computer vision algo-rithms, CNNs can provide better accuracy on problems withdiverse features without having to manually extract saidfeatures [5]. This holds true only when the training datasethas a fair distribution of samples. Correctly identifying amask on a person’s face is a relatively simple task for thesepowerful algorithms. However, a more precise classificationof the exact positioning of the mask and identifying theexposed region of the face is more challenging. To maintainequivalent classification accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms must be ableto generalize the relevant features over all subjects.The deployment scenarios for the CNN should also betaken into consideration. A face-mask detector can be set atthe entrance of corporate buildings, shopping areas, airportcheckpoints, and speed gates. These distributed settings re-quire cheap, battery-powered, edge devices which are limitedin memory and compute power. To maintain security anddata privacy of the public, all processing must remain on theedge-device without any communication with cloud servers.Minimizing power and resource utilization while maintain-ing a high classification accuracy is a design challenge whichnecessitates hardware-software co-design. In this context,we propose Binary-CoP (Binary COVID-mask Predictor), anefficient binary neural network (BNN) classifier for real-timeclassification of correct face-mask wear and positioning. Thechallenges of the described application are tackled throughthe following contributions: • Training BNNs on synthetically generated data [6] tocover a wide demographic and generalize relevant task-related features. A high accuracy of ∼
98% is achievedfor a 4-class problem of mask wear and positioning. • Deploying BNNs on a low-power, real-time embeddedFPGA accelerator based on the Xilinx FINN architec-ture [7]. The accelerator can operate at a low-powerof ∼ ∼ • The BNNs are analyzed through Gradient-weightedClass Activation Mapping (Grad-CAM) to improve in-terpretability and study the features being learned. a r X i v : . [ c s . C V ] F e b I. R
ELATED W ORK
A. COVID-19 Face-Mask Wear and Positioning
Correctly worn masks play a pivotal role in mitigatingthe spread of the COVID-19 disease during the ongoingpandemic [8]. Members of the general public often underes-timate the importance of this simple yet effective methodof disease prevention and control. Researchers and datascientists in the field of computer vision have collecteddata to train and deploy algorithms which help in au-tomatically regulating masks in public spaces and indoorlocations [9], [10]. Although large-scale natural face datasetsexist, the number of real-world masked images is limited [9].Wang et al. [10] extended their masked-face dataset witha Simulated Masked Face Recognition Dataset (SMFRD),which is synthetically generated by applying virtual masksto existing natural face datasets. Cabani et al. [6] improvedthe generation of synthetically masked-faces by applying adeformable mask-model onto natural face images with thehelp of automatically detected facial key-points. The key-points of the deformable mask-model can be matched to thekey-points of the face, allowing the application of the mask ina variety of ways. This allows the dataset generation processto further generate examples of incorrectly worn masks, suchas chin exposed, nose exposed or nose and mouth exposed.
B. Binary Neural Networks
The memory footprint of neural networks and the com-plexity of their arithmetic operations on inference hardwarecan be reduced through parameter quantization. In the mostextreme case, binarizing neural networks constrains theirweights and activations to {− , } , such that their memoryfootprint is theoretically reduced by × compared to afloat-32 CNN [11]. Additionally, simple XNOR and popcount operations can be used to implement multiply-accumulate(MAC) operations on inference hardware [12]. Specializedtraining schemes have been proposed to mitigate the lossin information capacity introduced by the low-bitwidth rep-resentation of BNNs [11], [13], [14], [15], [12]. In somecases, the low information capacity due to binarization canhave a regularization effect which improves feature general-ization [13]. This is helpful in improving the classificationperformance on real-world data, particularly when trainingon synthetically generated data [16]. In [13], Courbariaux etal. introduced a scheme to train neural networks with binaryweights during forward propagation while maintaining latentfull-precision values during back propagation. This ensuresproper gradient flow and fine adjustments through the gra-dients. This approach is later extended by the binarizationof activations [11]. Rastegari et al. [12] proposed XNOR-Net, where both weights and activations are binarized suchthat the convolutions of input feature maps and weightscan be approximated by a combination of
XNOR operationsand popcounts , followed by a multiplication with scalingfactors. The introduction of scaling factors improves theinformation capacity of the network at the cost of moretrainable parameters for each layer. This adds to the compu- tational complexity of XNOR-Net at deployment time. Forthe task of face-mask detection with low scene complexity,more efficient forms of BNNs [11] can be applied.
C. BNN Hardware Accelerators
Several accelerators have been designed to exploit thebenefits of BNNs [17], [18], [7], [19], [20], [21], [22].The Xilinx FINN [7] framework was developed to acceler-ate BNNs efficiently on FPGA platforms. The frameworkcompiles high level synthesis (HLS) code from a BNNdescription to create a hardware design for the network.The generated streaming architecture consists of a pipelineof individual hardware components instantiated for eachlayer of the BNN. In this work, we deploy Binary-CoP onFINN-based hardware architectures to achieve an efficientacceleration of the masked-face inference on embeddedFPGAs. We parameterize and synthesize accelerators withdifferent hardware requirements, geared towards individualCOVID-19 mask recognition (low-power) or crowd statisticscollection (high-performance).III. M
ETHOD
This section describes the building blocks of Binary-CoPfor the classification of correct mask wear and positioning.
A. Training and Inference of Binary Neural Networks
The BNN method proposed by Courbariaux et al. [11]serves as our foundation to efficiently approximate weightsand activations to single bit precision at inference time,such that the neural network’s arithmetic operations canbe executed by simple logic operations. Smooth modeltraining and convergence is ensured by relying on full-precision latent weights W during training time [23]. Indetail, the activation tensor A l − ∈ R X i × Y i × C i , with itsdimensions of X i width, Y i height, and C i channels, servesas the input to the convolutional layer l ∈ [1 , ..., L ] . Here, A and A L represent the input image and the network’sprediction, respectively. The trainable parameters of the2D-convolutional layers are composed of the latent weightmatrix W ∈ R K × K × C i × C o required for training, with kerneldimension K , input channels C i , and output channels C o . Aspreviously stated, the latent weights are mapped to {− , +1 } during the forward pass for loss calculation or deployment,resulting in the binarized b ⊂ B ∈ B K × K × C i × C o . In thehardware implementation, − is expressed as a binary to perform multiplications as XNOR logic operations. The sign () function in Eq. 1 is used to binarize the input featuremaps and weights. b = sign ( w ) = (cid:26) if w ≥ , − otherwise . (1)The derivative of the sign () function is almost always zero,resulting in insufficient gradient flow during training andback-propagation. This necessitates gradient flow approxi-mation using a straight-through estimator (STE) [23].Particularly for BNNs, it is of crucial importance to adjustthe input elements a l − ⊂ A l − , before the approximation nput signsignBinary Conv Binary CoP-Net FPGA-based Accelerator
Binary convolutional layer:
Binary CoP-Net
Battery-powered DeviceCrowd Feature H i g h - P e r f o r m a n c e P o w e r - S a v i n g Prediction
Correctly MaskedUncovered ChinUncovered NoseUncovered Nose & Mouth
Camera Application Scenario
BinaryActivationsand Weights(<15k Byte)
MVTU: • PEs • SIMD • H l-1 , B l • Pipeline Buffers • SWU • Pipeline Buffers • Sliding Window Unit (SWU)
XNOR
PopCnt
WeightMemory A cc u m u l a t o r T h r e s h o l d M e m o r y + >> Processing Element (PE) W r i t e t o P E s SynthesisPerformanceup to 6400 Frames per Second Power ~ BatchNorm
Fig. 1: Schematic representation of the Binary-CoP accelerator. A camera captures images to be classified by the neuralnetwork. The BNN accelerator is tailored for the application scenario (single gate prediction or crowd statistics collection).Binary tensors are processed in the PEs of the FPGA-based accelerator using XNOR operations. The classification of theinput data is available after completion of the computations at low-latency or low-power.into the binary representation h l − ⊂ H l − ∈ B X i × Y i × C i by means of batch normalization to zero mean and unitvariance. An advantage of BNNs is that the result of thebatch-norm operation is followed by sign () (see Fig. 1).Since the result after applying both functions is simply {− , } , the precise calculation of the batch-norm is wastefulon embedded hardware. Based on the batch-norm statisticscollected at training time, a threshold point τ is defined,wherein an activation value a l − ≥ τ results in 1, otherwise-1 [7]. This allows the implementation of the typically costlybatch-norm operation as a simple magnitude comparisonoperation on hardware.Next, the binary convolution follows as: H l − = sign ( BatchNorm ( A l − )); B l = sign ( W l ) (2) A l = BinConv ( H l − , B l ) = PopCnt ( XNOR ( H l − , B l )) , (3)which results in the output feature map A l ∈ R X o × Y o × C o . B. Hardware Architecture
The trained BNNs are conditioned for deployment on theXilinx FINN framework [7]. The pipelined architecture offersseveral advantages on embedded devices, most importantly,the reduction in on-chip to off-chip memory transfers ofthe BNN parameters B l and intermediate activations A l and H l . This is mainly feasible due to the binary format, whichresults in highly compact neural networks that can fit on theon-chip memory units of embedded devices. The number ofprocessing elements (PEs), single-instruction-multiple-data(SIMD)-lanes, and other parameters can be optimized by the designer to suit the acceleration of the trained BNN. Thefinal design is synthesized and implemented on an embeddedFPGA.For each convolutional or fully-connected layer in theBNN, a matrix-vector-threshold unit (MVTU) is instantiated,which executes the XNOR , popcount and threshold operationsmentioned in Sec. III-A. Each MVTU in the pipeline can bedimensioned for the number of PEs and SIMD lanes, whichhave a significant impact on hardware resource utilization,latency and the effective throughput of the pipeline. Basedon the compute complexity of each layer, the availablehardware resources need to be distributed over the corre-sponding MVTUs, such that all parts of the pipeline havea matched-throughput. A single under-dimensioned MVTUcould throttle the entire pipeline, resulting in sub-optimalthroughput. A single MVTU of the pipeline is shown inFig. 1, and a corresponding PE is detailed.For convolutional layers, an additional sliding-windowunit (SWU) reshapes the binarized activation maps to create asingle, wide input feature map memory, which can efficientlybe accessed by the corresponding MVTU. Max-pool layersare implemented as boolean OR operations, since a singlebinary “1” value suffices to make the entire pool windowoutput equal to 1. C. BNN Interpretability with Grad-CAM
The output of the convolutional layers in a CNN containslocalized information of the input image, without any priorbias on the location of objects and features during training.This information can be captured using Class Activationapping (CAM) [24] and Gradient-weighted Class Activa-tion Mapping (Grad-CAM) [25] techniques. To apply CAM,the model must end with a global average pooling layerfollowed by a fully-connected layer, providing the logits of aparticular input. The BNN models investigated in this workoperate on a small input resolution of 32 ×
32, and achieve ahigh reduction of spatial information without incorporating aglobal average pooling layer. For this reason, the Grad-CAMapproach is better-suited to obtain visual interpretations ofBinary-CoP’s attention and determine the important regionsfor its predictions of different inputs and classes.To obtain the class-discriminative localization map, weconsider the activations and gradients for the output of the conv2 2 layer, which has spatial dimensions of 5 ×
5. Weuse average pooling for the corresponding gradients andreduce the channels by performing Einstein summation asspecified in [25]. With this approach the base networks do notneed any modifications or retraining. Due to the syntheticallygenerated dataset used for training, we expect Binary-CoPmodels to generalize well against domain shifts.IV. R
ESULTS AND D ESIGN S PACE E XPLORATION
A. Experimental Setup
Binary-CoP is able to detect the presence of a mask, aswell as its position and correctness. This level of classi-fication detail is possible through the more detailed splitof the MaskedFace-Net dataset [6] from 2 classes, namelyCorrectly Masked Face Dataset (CMFD) and IncorrectlyMasked Face Dataset (IMFD), to 4 classes of CMFD, IMFDNose, IMFD Chin, and IMFD Nose and Mouth. The datasetsuffers from high imbalance in the number of samples perclass. From the total 133,783 samples, roughly 5% of thesamples are IMFD Chin, and another 5% samples are IMFDNose and Mouth. CMFD samples make up 51% of the totaldataset while IMFD Nose makes up 39%. The dataset inits raw distribution would heavily bias the training towardsthe two dominant classes. To counter this, we randomlysample the larger classes CMFD and IMFD Nose to collecta comparable number of examples to the two remainingclasses, IMFD Chin and IMFD Nose and Mouth. The evenlybalanced dataset is then randomly augmented with a varyingcombination of contrast, brightness, gaussian noise, flip androtate operations. The final size of the balanced dataset is110K train and validation examples and 28K test samples.The images are resized to 32 ×
32 pixels, similar to theCIFAR-10 [26] dataset. The BNNs are trained up to 300epochs, unless learning saturates earlier. The full-precision(FP32) variant used for the Grad-CAM comparison is trainedfor 175 epochs due to early learning saturation (98.6% finaltest accuracy). We trained the BNN architectures shownin Tab. I according to the method described in Sec.III-A.Each convolutional (Conv) and fully connected (FC) layeris followed by batch-norm and activation layers except forthe final layer. Conv groups 1 and 2 are followed by a max-pool layer. The target System-on-Chip (SoC) platform forthe experiments is the Xilinx XC7Z020 (Z7020) chip. The µ -CNV design can also be synthesized for the more constrained XC7Z010 (Z7010) chip, when XNOR operations are offloadedto the DSP blocks as described in [27]. Power and throughputmeasurements are taken directly on a running system. Thepower is measured at the power supply of the board (includesboth PS and PL). The throughput reported is the classificationrate when the accelerator’s pipeline is full.TABLE I: Network architectures and hardware dimensioning.
Network CNV n -CNV µ -CNVArch. L | [ C i ,C o ] K = 3 ∀ Conv Conv 1 1 | [3, 64]Conv 1 2 | [64, 64]Conv 2 1 | [64, 128]Conv 2 2 | [128, 128]Conv 3 1 | [128, 256]Conv 3 2 | [256, 256]FC 1 | [512]FC 2 | [512]FC 3 | [4] Conv 1 1 | [3, 16]Conv 1 2 | [16, 16]Conv 2 1 | [16, 32]Conv 2 2 | [32, 32]Conv 3 1 | [32, 64]Conv 3 2 | [64, 64]FC 1 | [128]FC 2 | [128]FC 3 | [4] Conv 1 1 | [3,16]Conv 1 2 | [16, 16]Conv 2 1 | [16, 32]Conv 2 2 | [32, 32]Conv 3 1 | [32, 64]FC 1 | [128]FC 2 | [4] PE Count
16, 32, 16, 16, 4, 1, 1, 1, 4 16, 16, 16, 16, 4, 1, 1, 1, 1 4, 4, 4, 4, 1, 1, 1
SIMD lanes
3, 32, 32, 32, 32, 32, 4, 8, 1 3, 16, 16, 32, 32, 32, 4, 8, 1 3, 16, 16, 32, 32, 16, 1
B. Design Space Exploration
We evaluate three Binary-CoP prototypes, namely CNV, n -CNV and µ -CNV. The CNV network is based on thearchitecture in [7] inspired by VGG-16 [28] and Bina-ryNet [11]. n -CNV is a downsized version for a smallermemory footprint, and µ -CNV has fewer layers to reduce thesize of the synthesized design. All designs are synthesizedwith a target clock frequency of 100MHz.In Tab. II, the hardware utilization for the Binary-CoPprototypes is provided. With µ -CNV, a significant reductionin LUTs is achieved, which makes the design synthesizableon the heavily constrained Z7010 SoC. The trade-off is aslight increase in the memory footprint of the BNN, asthe shallower network has a larger spatial dimension beforethe fully-connected layers, increasing the total number ofparameters after the last convolutional layer. The choice ofPE count and SIMD lanes for the n -CNV prototype allowit to reach a maximum throughput of ∼ . W, which is required mostly by thesoft-core on the SoC. In this setting, a classification needsto be triggered only when a subject is attempting to passthrough the entrance where Binary-CoP is deployed.
C. Grad-CAM Analysis
The confusion matrix in Fig. 2 shows the generalization ofBinary-CoP-CNV on all classes after balancing the dataset.We further analyze the output heat maps generated by Grad-CAM to interpret the predictions of our BNNs with respectto the diverse attributes of the MaskedFace-Net dataset. InFig. 3 - Fig. 9, column 1 and 2 indicate the label and inputimage respectively. Columns 3, 4 and 5 highlight the heatmaps obtained from the Grad-CAM output of Binary-CoP-CNV, Binary-CoP- n -CNV and a full-precision version ofCNV with float-32 parameters (FP32). The heat maps areoverlayed on the raw input images for better visualization.All raw images chosen have been classified correctly by allABLE II: Hardware results of design space exploration. Configuration LUT BRAM DSP Acc.
CNV 26060 124 24 n -CNV 20425
14 93.94 µ -CNV
14 27 93.78
Correct 712598% 411% 10% 901%Nose 260% 704298% 942% 260%N+M 40% 791% 565198% 90%Chin 1071%Correct 411%Nose 70%N+M 736398%Chin T r u e C l a ss Predicted Class
Fig. 2: Confusion matrix of Binary-CoP-CNV on the test set.the networks, for fair interpretation of feature-to-predictioncorrelation.In Fig. 3, we analyze the Region of Interest (RoI) forthe correctly masked class. Binary-CoP’s learning capacityallows it to focus on key facial lineaments of the humanwearing the mask, rather than the mask itself. This potentiallyhelps in generalizing on other mask types. For the childexample shown in the first row, the focus of Binary-CoPvariants lies on the nose, making sure that it is fully coveredby the mask. Similarly, for the adult in row 2, Binary-CoP-CNV focuses on the exposed cheekbones, to assert itscorrect mask prediction. This also holds for our small versionof Binary-CoP, with significantly reduced learning capacity.The RoI curves finely above the mask, tracing the exposedregion of the face. In the third row example, Binary-CoP-CNV falls back to focusing on the mask, whereas Binary-
Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedCorrectlyMasked Fig. 3: Grad-CAM results for the correctly-masked class.
Label Raw BCoP BCoP FP32CNV n -CNVNoseExposedNoseExposedNoseExposed Fig. 4: Grad-CAM results for the nose-exposed class.
Label Raw BCoP BCoP FP32CNV n -CNVNoseMouthExposedNoseMouthExposedNoseMouthExposed Fig. 5: Grad-CAM results for the nose and mouth-exposedclass.CoP- n -CNV continues to focus on the exposed features.Both models achieve the same prediction by focusing ondifferent parts of the raw image. In contrast to the Binary-CoP variants, the full precision FP32 model seems to focuson a combination of several different features. This can beattributed to its larger learning capacity.In Fig. 4, we analyze the Grad-CAM output of the uncov-ered nose class. Binary-CoP-CNV and Binary-CoP- n -CNVfocus specifically on two regions, namely the nose and thestraight upper edge of the mask. These clear characteristicscannot be observed with the oversized FP32 CNN. In Fig 5,the results show the RoI for predicting the exposed mouthand nose class. All models seem to distribute their attentiononto several exposed features of the face.Fig. 6 shows Grad-CAM results which predict the chinbeing exposed. The top region of the mask points upwards,similar to the correctly worn mask. Therefore, the BNNs payless attention to this region and instead focus on the neckand chin. With the full precision FP32 model, it is difficultto interpret the reason for the correct classification, as littleto no focus is given to the chin region.Beyond studying the BNNs’ behavior on different classpredictions, we can use the attention heat maps to understandthe generalization behavior of the classifier. In Fig. 7 - Fig. 9, abel Raw BCoP BCoP FP32CNV n -CNVChinExposedChinExposedChinExposed Fig. 6: Grad-CAM results for the chin-exposed class.
Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedCorrectlyMasked Fig. 7: Grad-CAM results for age generalization.we test Binary-CoP’s generalization over ages, hair colorsand head gear, as well as complete face manipulation withdouble-masks, face paint and sunglasses. In Fig 7, we see thatthe smaller eyes of infants and elderly do not hinder Binary-CoP’s ability to focus on the top region of the correctlyworn masks. In Fig. 8, Binary-CoP-CNV shows resilienceto differently colored hair and head-gear, even when havinga similar light-blue color as the face-masks (row 2 and3). In contrast, the FP32 model’s attention seems to shifttowards the hair and head-gear for these cases. Finally, inFig. 9, both Binary-CoP variants focus on relevant featuresof the corresponding label, irrespective of the obscured ormanipulated faces. This empirically shows that the complextraining of BNNs, along with their lower information ca-pacity, constrains them to focus on a smaller set of relevantfeatures, thereby generalizing well for unprecedented cases.V. C
ONCLUSION
In this paper, we apply binary neural networks to thetask of classifying the correctness of face-mask wear andpositioning. In the context of the ongoing COVID-19 pan-demic, such algorithms can be used at entrances to corporatebuildings, airports, shopping areas, and other indoor locationsto mitigate the spread of the virus. Applying BNNs to thisapplication solves several challenges such as (1) Maintainingdata privacy of the public by processing data on the edge-
Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedNoseExposed Fig. 8: Grad-CAM results for hair/headgear generalization.
Label Raw BCoP BCoP FP32CNV n -CNVCorrectlyMaskedCorrectlyMaskedChinExposedNoseMouthExposedNoseMouthExposed Fig. 9: Grad-CAM results for face manipulation with double-masks, face paint and sunglasses.device, (2) Deploying the classifier on an efficient XNOR-based accelerator to achieve low-power computation, and(3) Minimizing the neural network’s memory footprint byrepresenting all parameters in the binary domain, enablingdeployment on low-cost, embedded hardware. The acceler-ator requires only ∼ ∼ for four wearing positions of theMaskedFace-Net dataset. The Grad-CAM approach is usedto study the features learned by the proposed Binary-CoPclassifier. The results show the classifier’s high generaliza-tion ability, allowing it to perform well on different facestructures, skin-tones, hair types, and age groups.R EFERENCES[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,”
Proceedings of the IEEE ,ol. 86, no. 11, pp. 2278–2324, 1998.[2] L. Wang and A. Wong, “Covid-net: A tailored deep convolutionalneural network design for detection of covid-19 cases from chest x-ray images,” 2020.[3] A. I. Khan, J. L. Shah, and M. M. Bhat, “Coronet: Adeep neural network for detection and diagnosis of covid-19from chest x-ray images,”
Computer Methods and Programs inBiomedicine
Advances in Computer Vision , K. Araiand S. Kapoor, Eds. Cham: Springer International Publishing, 2020,pp. 128–144.[6] A. Cabani, K. Hammoudi, H. Benhabiles, and M. Melkemi,“Maskedface-net – a dataset of correctly/incorrectly maskedface images in the context of covid-19,”
Smart Health
Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable GateArrays , ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 65–74.[Online]. Available: http://doi.acm.org/10.1145/3020078.3021744[8] T. Mitze, R. Kosfeld, J. Rode, and K. W¨alde, “Face masksconsiderably reduce covid-19 cases in germany,”
Proceedings of theNational Academy of Sciences , 2017, pp. 426–434.[10] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, H. Wu, P. Yi,K. Jiang, N. Wang, Y. Pei, H. Chen, Y. Miao, Z. Huang, and J. Liang,“Masked face recognition dataset and application,” 2020.[11] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv,and Y. Bengio, “Binarized neural networks,” in
Advancesin Neural Information Processing Systems 29 . CurranAssociates, Inc., 2016, pp. 4107–4115. [Online]. Available:http://papers.nips.cc/paper/6573-binarized-neural-networks.pdf[12] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional NeuralNetworks,” in
The European Conference on Computer Vision (ECCV) .Cham: Springer International Publishing, 2016, pp. 525–542.[13] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Train-ing deep neural networks with binary weights during propagations,”in
Advances in Neural Information Processing Systems (NeurIPS) ,C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,Eds. Curran Associates, Inc., 2015, pp. 3123–3131.[14] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “BNN+:improved binary network training,”
CoRR , vol. abs/1812.11800, 2018.[Online]. Available: http://arxiv.org/abs/1812.11800[15] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutionalneural network,” in
Advances in Neural Information Processing Sys-tems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc.,2017, pp. 345–353. [Online]. Available: http://papers.nips.cc/paper/6638-towards-accurate-binary-convolutional-neural-network.pdf[16] A. Frickenstein, M.-R. Vemparala, J. Mayr, N.-S. Nagaraja, C. Unger,F. Tombari, and W. Stechele, “Binary DAD-Net: Binarized DrivableArea Detection Network for Autonomous Driving,” in
InternationalConference on Robotics and Automation (ICRA) , Paris, France, 2020.[17] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An ar-chitecture for ultralow power binary-weight cnn acceleration,”
IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems , vol. 37, no. 1, pp. 48–60, Jan 2018.[18] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara,S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Mo-tomura, “Brein memory: A single-chip binary/ternary reconfigurablein-memory deep neural network accelerator achieving 1.4 tops at 0.6w,”
IEEE Journal of Solid-State Circuits , vol. 53, no. 4, pp. 983–994,April 2018. [19] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “Fp-bnn,”
Neurocomput. , vol. 275, no. C, p. 1072–1086, Jan. 2018. [Online].Available: https://doi.org/10.1016/j.neucom.2017.09.046[20] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “Fbna: Afully binarized neural network accelerator,” in ,2018, pp. 51–513.[21] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh,and D. Marr, “Accelerating binarized neural networks: Comparison offpga, cpu, gpu, and asic,” in , 2016, pp. 77–84.[22] C. Fu, S. Zhu, H. Su, C.-E. Lee, and J. Zhao, “Towards fastand energy-efficient binarized neural network inference on fpga,” in
Proceedings of the 2019 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , ser. FPGA ’19. New York, NY,USA: Association for Computing Machinery, 2019, p. 306. [Online].Available: https://doi.org/10.1145/3289602.3293990[23] Y. Bengio, N. L´eonard, and A. C. Courville, “Estimating orpropagating gradients through stochastic neurons for conditionalcomputation,”
CoRR , vol. abs/1308.3432, 2013. [Online]. Available:http://arxiv.org/abs/1308.3432[24] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,“Learning deep features for discriminative localization,” in
IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR) , June2016, pp. 2921–2929.[25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra, “Grad-cam: Visual explanations from deep networks viagradient-based localization,” in
Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[26] A. Krizhevsky, “Learning multiple layers of features from tiny im-ages,”
University of Toronto , 2009.[27] N. Fasfous, M. R. Vemparala, A. Frickenstein, and W. Stechele, “Or-thruspe: Runtime reconfigurable processing elements for binary neuralnetworks,” in , 2020, pp. 1662–1667.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in