Enabling Binary Neural Network Training on the Edge
Erwei Wang, James J. Davis, Daniele Moro, Piotr Zielinski, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, George A. Constantinides
EEnabling Binary Neural Network Training on the Edge
Erwei Wang James J. Davis Daniele Moro Piotr Zielinski Claudionor Coelho Satrajit Chatterjee Peter Y. K. Cheung George A. Constantinides Abstract
The ever-growing computational demands of in-creasingly complex machine learning models fre-quently necessitate the use of powerful cloud-based infrastructure for their training. Binaryneural networks are known to be promising can-didates for on-device inference due to their ex-treme compute and memory savings over higher-precision alternatives. In this paper, we demon-strate that they are also strongly robust to gradientquantization, thereby making the training of mod-ern models on the edge a practical reality. Weintroduce a low-cost binary neural network train-ing strategy exhibiting sizable memory footprintreductions and energy savings vs Courbariaux &Bengio’s standard approach. Against the latter,we see coincident memory requirement and en-ergy consumption drops of 2–6 × , while reachingsimilar test accuracy in comparable time, acrossa range of small-scale models trained to classifypopular datasets. We also showcase ImageNettraining of ResNetE-18, achieving a 3.12 × mem-ory reduction over the aforementioned standard.Such savings will allow for unnecessary cloud of-floading to be avoided, reducing latency, increas-ing energy efficiency and safeguarding privacy.
1. Introduction
Although binary neural networks (BNNs) feature weightsand activations with just single-bit precision, many modelsare able to reach accuracy indistinguishable from that oftheir higher-precision counterparts (Courbariaux & Bengio,2016; Wang et al., 2019b). Since BNNs are functionallycomplete, their limited precision does not impose an up-per bound on achievable accuracy (Constantinides, 2019).BNNs represent the ideal class of neural networks for edgeinference, particularly for custom hardware implementation, Department of Electrical Engineering, Imperial College Lon-don, London, UK. Google, San Jose, CA, USA. Palo Alto Net-works, Santa Clara, CA, USA. Correspondence to: Erwei Wang < [email protected] > . due to their use of XNOR for multiplication: a fast andcheap operation to perform. Their use of compact weightsalso suits systems with limited memory and increases oppor-tunities for caching, providing further potential performanceboosts. FINN, the seminal BNN implementation for field-programmable gate arrays (FPGAs), reached the highestCIFAR-10 and SVHN classification rates to date at the timeof its publication (Umuroglu et al., 2017).Despite featuring binary forward propagation, existingBNN training approaches perform backward propagationusing high-precision floating-point data types—typically float32 —often making training infeasible on resource-constrained devices. The high-precision activations usedbetween forward and backward propagation commonly con-stitute the largest proportion of the total memory footprint ofa training run (Sohoni et al., 2019; Cai et al., 2020). More-over, backward propagation with high-precision gradients iscostly, challenging the energy limitations of edge platforms.An understanding of standard BNN training algorithms ledus to ask two questions: why are high-precision weightgradients used when we are only concerned with weights’ signs , and why are high-precision activations used when thecomputation of weight gradients only requires binary acti-vations as input? In this paper, we present a low-memory,low-energy BNN training scheme based on this intuitionfeaturing (i) the use of binary, power-of-two and 16-bitfloating-point data types, and (ii) batch normalization modi-fications enabling the buffering of binary activations.By increasing the viability of learning on the edge, thiswork will reduce the domain mismatch between trainingand inference—particularly in conjunction with federatedlearning (McMahan et al., 2017; Bonawitz et al., 2019)—and ensure privacy for sensitive applications (Agarwal et al.,2018). Via the aggressive energy and memory footprint re-ductions they facilitate, our proposals will enable networksto be trained without the network access reliance, latencyand energy overheads or data divulgence inherent to cloudoffloading. We make the following novel contributions.• We conduct the first variable representation and life-time analysis of the standard BNN training process,informing the application of approximations beneficial a r X i v : . [ c s . L G ] F e b nabling Binary Neural Network Training on the Edge Table 1.
Comparison of applied approximations vs related low-cost neural network training works.
Weights Weightgradients Activations Activationgradients BatchnormalizationZhou et al. (2016) int6 int6 int6 int6 (cid:56) Gruslys et al. (2016) (cid:56) (cid:56)
Recomputed (cid:56) (cid:56) Ginsburg et al. (2017) float16 float16 float16 float16 (cid:56)
Graham (2017) (cid:56) (cid:56) int (cid:56) (cid:56)
Bernstein et al. (2018) (cid:56) bool (cid:56) (cid:56) (cid:56)
Wu et al. (2018b) (cid:56) (cid:56) (cid:56) (cid:56) l This work bool bool bool po2 BNN-specific l Arbitrary precision was supported, but significant accuracy degradation was observed below 6 bits. Activations were not retained between forward and backward propagation in order to save memory. Power-of-two format comprising sign bit and exponent. to memory and energy consumption. In particular, we – binarize weight gradients owing to the lack ofimportance of their magnitudes, – modify the forward and backward batch normal-ization operations such that we remove the needto buffer high-precision activations and – determine and apply appropriate additional quan-tization schemes—power-of-two activation gradi-ents and reduced-precision floating-point data—taken from the literature.• Against Courbariaux & Bengio (2016)’s approach, wedemonstrate the preservation of test accuracy and con-vergence rate when training BNNs to classify MNIST,CIFAR-10, SVHN and ImageNet while lowering mem-ory and energy needs by up to 5.67 × and 4.53 × .• We provide an open-source release of our training soft-ware, along with our memory and energy estimationtools, for the community to use and build upon .
2. Related Work
The authors of all published works on BNN inference ac-celeration to date made use of high-precision floating-pointdata types during training (Courbariaux et al., 2015; Cour-bariaux & Bengio, 2016; Lin et al., 2017; Ghasemzadehet al., 2018; Liu et al., 2018; Wang et al., 2019a; 2020;Umuroglu et al., 2020; He et al., 2020; Liu et al., 2020).There is precedent, however, for the use of quantizationwhen training non-binary networks, as we show in Ta-ble 1 via side-by-side comparison of the approximationapproaches taken in those works along with those herein.The effects of quantizing the gradients of networks withhigh-precision data, either fixed or floating point, have been https://github.com/awai54st/Enabling-Binary-Neural-Network-Training-on-the-Edge studied extensively. Zhou et al. (2016) and Wu et al. (2018a)trained networks with fixed-point weights and activationsusing fixed-point gradients, reporting no accuracy loss forAlexNet classifying ImageNet with gradients wider thanfive bits. Wen et al. (2017) and Bernstein et al. (2018) fo-cused solely on aggressive weight gradient quantization,aiming to reduce communication costs for distributed learn-ing. Weight gradients were losslessly quantized into ternaryand binary formats, respectively, with forward propaga-tion and activation gradients kept at high precision. In thiswork, we make the novel observations that activation gra-dient dynamic range is more important than precision, andthat BNNs are more robust to approximation than higher-precision networks. We thus propose a data representationscheme more aggressive than all of the aforementionedworks combined, delivering large memory and energy sav-ings with near-lossless performance.Gradient checkpointing—the recomputation of activationsduring backward propagation—has been proposed as amethod to reduce the memory consumption of train-ing (Chen et al., 2016; Gruslys et al., 2016). Such methodsintroduce additional forward passes, however, and so in-crease each run’s duration and energy cost. Graham (2017)and Chakrabarti & Moseley (2019) saved memory duringtraining by buffering activations in low-precision formats,achieving comparable accuracy to all- float32 baselines.Wu et al. (2018b) and Hoffer et al. (2018) reported reducedcomputational costs via l batch normalization. Finally, Hel-wegen et al. (2019) asserted that the use of both trainableweights and momenta is superfluous in BNN optimizers,proposing a weightless BNN-specific optimizer, Bop, ableto reach the same level of accuracy as Adam. We took inspi-ration from these works in locating sources of redundancypresent in standard BNN training schemes, and proposeBNN-specific modifications to l batch normalization allow-ing for activation quantization all the way down to binary,thus saving memory and energy without increasing latency. nabling Binary Neural Network Training on the Edge Algorithm 1
Standard BNN training step. for l ← { , · · · , L − } do // Forward ˆ X l ← sgn ( X l ) ˆ W l ← sgn ( W l ) Y l ← ˆ X l ˆ W l for m ← { , · · · , M l } do x ( m ) l +1 ← y ( m ) l − µ (cid:16) y ( m ) l (cid:17) σ (cid:16) y ( m ) l (cid:17) + β ( m ) l end for end for for l ← { L − , · · · , } do // Backward for m ← { , · · · , M l } do v ← σ (cid:16) y ( m ) l (cid:17) ∂ x ( m ) l +1 ∂ y ( m ) l ← v − µ ( v ) − µ (cid:16) v (cid:12) x ( m ) l +1 (cid:17) x ( m ) l +1 ∂β ( m ) l ← (cid:80) ∂ x ( m ) l +1 end for ∂ X l ← ∂ Y l ˆ W T l ∂ W l ← ˆ X T l ∂ Y l end for for l ← { , · · · , L − } do // Update W l ← Optimize ( W l , ∂ W l , η ) β l ← Optimize ( β l , ∂ β l , η ) end for η ← LearningRateSchedule ( η ) Algorithm 2
Proposed BNN training step. for l ← { , · · · , L − } do // Forward ˆ X l ← sgn ( X l ) ˆ W l ← sgn ( W l ) Y l ← ˆ X l ˆ W l for m ← { , · · · , M l } do x ( m ) l +1 ← y ( m ) l − µ (cid:16) y ( m ) l (cid:17) (cid:107) y ( m ) l − µ ( y ( m ) l ) (cid:107) / B + β ( m ) l end for end for for l ← { L − , · · · , } do // Backward for m ← { , · · · , M l } do v ← (cid:107) y ( m ) l − µ ( y ( m ) l ) (cid:107) / B ∂ x ( m ) l +1 ∂ y ( m ) l ← v − µ ( v ) − µ (cid:16) v (cid:12) ˆ x ( m ) l +1 (cid:13)(cid:13)(cid:13) x ( m ) l +1 (cid:13)(cid:13)(cid:13) / B (cid:17) ˆ x ( m ) l +1 ∂β ( m ) l ← (cid:80) ∂ x ( m ) l +1 end for ∂ ˜ Y l ← po2 ( ∂ Y l ) ∂ X l ← ∂ ˜ Y l ˆ W T l ∂ W l ← ˆ X T l ∂ ˜ Y l ∂ ˆ W l ← sgn ( ∂ W l ) end for for l ← { , · · · , L − } do // Update W l ← Optimize (cid:0) W l , ∂ ˆ W l / √ M l − , η (cid:1) β l ← Optimize ( β l , ∂ β l , η ) end for η ← LearningRateSchedule ( η )
3. Standard Training Flow X l ∂ X l sgn × BN X l +1 ∂ X l +1 sgn W l ˆ X l Y l ˆ W l ∂ W l ∂ Y l Figure 1.
Standard BNN training graph for fully connected layer l . “sgn,” “ × ” and “BN” are sign, matrix multiplication and batchnormalization operations. Forward propagation dependencies areshown in black; those for backward passes are gray. For simplicity, we assume the use of a multi-layer percep-tron (MLP), although the presence of convolutional layerswould not change any of the principles that follow. Let W l and X l denote matrices of weights and activations, respec- tively, in the network’s l th layer, with ∂ W l and ∂ X l beingtheir gradients. For W l , rows and columns span input andoutput channels, respectively, while for X l they representfeature maps and channels. Henceforth, we use decorationto indicate low-precision data representation, with ˆ • and ˜ • respectively denoting binary and power-of-two encoding.Figure 1 shows the training graph of a fully connected binarylayer. A detailed description of the standard BNN trainingprocedure introduced by Courbariaux & Bengio (2016) foreach batch of B training samples, which we henceforth referto as as a step , is provided in Algorithm 1. Therein, “ (cid:12) ”signifies element-wise multiplication. For brevity, we omitsome of the intricacies of the baseline implementation—lackof first-layer quantization, use of a final softmax layer andthe inclusion of weight gradient cancellation (Courbariaux& Bengio, 2016)—as these standard BNN practices are notimpacted by our work. We initialize weights as outlined byGlorot & Bengio (2010).Many authors have found that BNNs require batch normal-ization in order to avoid gradient explosion (Alizadeh et al.,2018; Sari et al., 2019; Qin et al., 2020), and our early ex-periments confirmed this to indeed be the case. We thus nabling Binary Neural Network Training on the Edge Table 2.
Memory-related properties of variables used during training. In order to obtain the exemplary quantities of memory given,BinaryNet was trained for CIFAR-10 classification with Adam.
Variable Per-layerlifetime Standard training
Proposed training
Data type Size (MiB) %
Data type Size (MiB) Saving ( × ) X (cid:56) float32 .
33 26 . bool .
48 32 . ∂ X , Y (cid:52) float32 .
00 11 . float16 .
00 2 . µ ( y l ) (cid:56) float32 .
01 0 . float16 .
01 2 . σ ( y l ) (cid:56) float32 .
01 0 . float16 .
01 2 . ∂ Y (cid:52) float32 .
00 11 . po2 5 .
81 6 . W (cid:56) float32 .
49 12 . float16 .
74 2 . ∂ W (cid:56) float32 .
49 12 . bool .
67 32 . β (cid:56) float32 .
01 0 . float16 .
01 2 . ∂ β (cid:56) float32 .
01 0 . float16 .
01 2 . Momenta (cid:56) float32 .
98 25 . float16 .
49 2 . Total .
33 100 . .
23 3 . Variables that need not be retained between forward, backward or update phases of Algorithms 1 and 2. ∂ X and Y can share memory since they are equally sized and have non-overlapping lifetime. apply it as standard. Matrix products Y l are channel-wisebatch-normalized across each layer’s M l output channels(Algorithm 1 lines 5–6) to form the subsequent layer’s in-puts, X l +1 . β constitutes the batch normalization biases.Layer-wise moving means µ ( y l ) and standard deviations σ ( y l ) are retained for use during backward propagation andinference. We forgo trainable scaling factors, commonlydenoted γ ; these are of irrelevance to BNNs since their acti-vations are binarized during forward propagation (line 2).
4. Variable Analysis
In order to quantify the potential gains from approximation,we conducted a variable representation and lifetime analysisof Algorithm 1 following the approach taken by Sohoniet al. (2019). Table 2 lists the properties of all variables inAlgorithm 1, with each variable’s contributions to the totalfootprint shown for a representative example. Variables aredivided into two classes: those that must remain in memorybetween computational phases (forward propagation, back-ward propagation and weight update), and those that neednot. This is of pertinence since, for those in the latter cate-gory, only the largest layer’s contribution counts towards thetotal memory occupancy. For example, ∂ X l is read duringthe backward propagation of layer l − only, thus ∂ X l − can safely overwrite ∂ X l for efficiency. Additionally, Y and ∂ X are shown together since they are equally sizedand only need to reside in memory during the forward andbackward pass for each layer, respectively.
5. Low-Cost BNN Training
As shown in Table 2, all variables within the standard BNNtraining flow use float32 representation. In the subsec-tions that follow, we detail the application of aggressiveapproximation specifically tailored to BNN training. Fur-ther to this, and in line with the observation by many authorsthat float16 can be used for ImageNet training withoutinducing accuracy loss (Ginsburg et al., 2017; Wang et al.,2018; Micikevicius et al., 2018), we also switch all remain-ing variables to this format. Our final training procedureis captured in Algorithm 2, with modifications from Algo-rithm 1 in red and the corresponding data representationsused shown in bold in Table 2. We provide both theoreticalevidence and training curves for all of our experiments inAppendix A to show that our proposed approximations donot cause material degradation to convergence rates.
Intuitively, BNNs should beparticularly robust to weight gradient binarization since theirweights only constitute signs. On line 15 of Algorithm 2,therefore, we quantize and store weight gradients in binaryformat, ∂ ˆ W , for use during weight update. During the latter,we attenuate the gradients by √ N l , where N l is layer l ’sfan-in, to reduce the learning rate and prevent prematureweight clipping as advised by Sari et al. (2019). Since fullyconnected layers are used as an example in Algorithm 2, N l = M l − in this instance.Table 2 shows that, with binarization, the portion of ourexemplary training run’s memory consumption attributableto weight gradients dropped from 53.49 to just 1.67 MiB, nabling Binary Neural Network Training on the Edge leaving the scarce resources available for more quantization-sensitive variables such as W and momenta. Energy con-sumption will also decrease due to the associated reductionin memory traffic. Power-of-two activation gradients.
The tolerance ofBNN training to weight gradient binarization further sug-gests that activation gradients can be aggressively approx-imated without causing significant accuracy loss. Unlikeprevious proposals, in which activation gradients were quan-tized into fixed- or block floating-point formats (Zhou et al.,2016; Wu et al., 2018a), we hypothesize that power-of-tworepresentation is more suitable due to their typically highinter-channel variance.We define power-of-two quantization into k -bit “ po2 k ”format as po2 k ( • ) = sgn ( • ) (cid:12) exp ( • ) − b , compris-ing a sign bit and k − -bit exponent exp ( • ) = max (cid:0) − k − , [ log ( • ) + b ] (cid:1) with bias b = 2 k − − − [ log ( (cid:107) • (cid:107) ∞ )] . Square brackets signify rounding to the near-est integer. With b , we scale exp ( • ) layer-wise to make effi-cient use of its dynamic range. This is applied to quantizematrix product gradients ∂ Y l on line 12 of Algorithm 2. Wechose to use k = 5 as standard, generally finding this valueto result in high compression while inducing little loss in ac-curacy. While we elected not to similarly approximate ∂ X due to its use in the computation of quantization-sensitive β ,our use of of ∂ ˜ Y = po2 ( ∂ Y ) nevertheless leads to sizeablereductions in total memory footprint. Our use of ∂ ˜ Y furtherallows us to reduce the energy consumption associated withlines 13–14 in Algorithm 2, for both of which we now haveone binary and one power-of-two operand. Assuming thatthe target training platform has native support for only 32-bitfixed- and floating-point arithmetic, these matrix multiplica-tions can be computed by (i) converting powers-of-two into int32 s via shifts, (ii) performing sign-flips and (iii) accu-mulating the int32 outputs. This consumes far less energythan standard training’s all- float32 equivalent. Analysis of the backward pass of Algorithm 1 reveals con-flicting requirements for the precision of X . When comput-ing weight gradients ∂ W (line 13), only binary activations ˆ X are needed. For the batch normalization training (lines 8–11), however, high-precision X is used. As was shown inTable 2, the storage of X between forward and backwardpropagation constitutes the single largest portion of the al-gorithm’s total memory. If we are able to use ˆ X in place of X for these operations, there will be no need to retain thehigh-precision activations, significantly reducing memoryfootprint as a result. We achieve this goal as follows. Step 1: l normalization. Standard batch normalizationsees channel-wise l normalization performed on each layer’s centralized activations. Since batch normalizationis immediately followed by binarization in BNNs, however,we argue that less-costly l normalization is good enoughin this circumstance. Replacement of batch normalization’sbackward propagation operation with our l norm-basedversion sees lines 9–10 of Algorithm 1 swapped with v ← ∂ x ( m ) l +1 (cid:13)(cid:13)(cid:13) y ( m ) l − µ (cid:16) y ( m ) l (cid:17)(cid:13)(cid:13)(cid:13) / B ∂ y ( m ) l ← v − µ ( v ) − µ (cid:16) v (cid:12) x ( m ) l +1 (cid:17) ˆ x ( m ) l +1 , (1)where B is the batch size. Not only does our use of l batchnormalization eliminate all squares and square roots, it alsotransforms one occurrence of x ( m ) l +1 into its binary form. Step 2: x ( m ) l +1 approximation. Since ∂ Y is quantized intoour power-of-two format immediately after its calculation(Algorithm 2 line 12), we hypothesize that it should berobust to approximation. Consequently, we replace the x ( m ) l +1 term remaining in (1) with the product of its signs andmean magnitude: ˆ x ( m ) l +1 (cid:13)(cid:13)(cid:13) x ( m ) l +1 (cid:13)(cid:13)(cid:13) / B .Our complete batch normalization training functions areshown on lines 8–11 of Algorithm 2, which only requirethe storage of binary ˆ X along with layer- and channel-wisescalars. With elements of X now binarized, we not onlyreduce its memory cost by 32 × but also save energy thanksto the corresponding memory traffic reduction.
6. Evaluation
We implemented our BNN training method using Keras andTensorFlow, and experimented with the small-scale MNIST,CIFAR-10 and SVHN datasets, as well as large-scale Ima-geNet, using a range of network models. Our baseline forcomparison was the standard BNN training method intro-duced by Courbariaux & Bengio (2016), and we followedthose authors’ practice of reporting the highest test accuracyachieved in each run. Energy consumption results wereobtained using the inference energy estimator from QK-eras (Coelho Jr. et al., 2020), which we extended to alsoestimate the energy consumption of training. This tool as-sumes the use of an in-order processor fabricated on a 45 nmprocess and a cacheless memory hierarchy, as modeled byHorowitz (2014), resulting in high-level, platform-agnosticenergy estimates useful for relative comparison. Note thatwe did not tune hyperparameters, thus it is likely that higheraccuracy than we report is achievable.For MNIST we evaluated using a five-layer MLP—henceforth simply denoted “MLP”—with 256 neurons perhidden layer, and CNV (Umuroglu et al., 2017) and Bi-naryNet (Courbariaux & Bengio, 2016) for both CIFAR-10 and SVHN. We used three popular BNN optimizers: nabling Binary Neural Network Training on the Edge
Table 3.
Test accuracy of non-binary and BNNs using standard and proposed training approaches with Adam. Results for our trainingapproach applied to the former are included for reference only; we do not advocate for its use with non-binary networks.
Model Dataset Top-1 test accuracyStandard training Reference training
Proposed training
NN (%) BNN (%) ∆ (pp) NN (%) ∆ (pp) BNN (%) ∆ (pp) MLP MNIST .
22 98 .
24 0 .
02 89 . − . . − . CNV CIFAR-10 .
37 82 . − .
70 69 . − . . − . CNV SVHN .
30 96 . − .
93 79 . − . . − . BinaryNet CIFAR-10 .
20 89 .
81 1 .
61 76 . − . . − . BinaryNet SVHN .
54 97 .
40 0 .
86 85 . − . . − . Non-binary neural network. Baseline: non-binary network with standard training. Baseline: BNN with standard training.
Table 4.
Memory footprint and per-batch energy consumption of the standard and our proposed training schemes using the Adam optimizer.
Model Memory Energy/batchStandard(MiB)
Proposed(MiB) Saving( × ) Standard(mJ)
Proposed(mJ) Saving( × ) MLP . .
56 2 . . .
97 2 . CNV . .
13 4 . . .
61 2 . BinaryNet . .
21 3 . . .
26 4 . Adam (Kingma & Ba, 2015), stochastic gradient descent(SGD) with momentum and Bop (Helwegen et al., 2019).While all three function reliably with our training scheme,we used Adam by default due to its stability. Experimentalsetup minutiae can be found in Appendix B.1.Our choice of quantization targets primarily rested on theintuition that BNNs should be more robust to approximationin backward propagation than their higher-precision counter-parts. To illustrate that this is indeed the case, we comparedour method’s loss when applied to BNNs vs float32 networks with identical topologies and hyperparameters.Generally, per Table 3, significantly higher accuracy degra-dation was observed for the non-binary networks, as ex-pected. While our proposed BNN training method doesexhibit limited accuracy degradation—a geomean drop of1.21 percentage points (pp) for these examples—this comesin return for simultaneous geomean memory and energy sav-ings of 3.66 × and 3.09 × , respectively, as shown in Table 4.It is also interesting to note that the training cost reductionsachievable for a given dataset depend on the model chosento classify it, as can be seen across Tables 3 and 4. Thisobservation is largely orthogonal to our work: by applyingour approach to the training of a more appropriately chosenmodel, one can obtain the advantages of both optimizednetwork selection and training, effectively benefiting twice.In order to explore the impacts of the various facets of our scheme, we applied them sequentially while training Bina-ryNet to classify CIFAR-10 with multiple optimizers. Asshown in Table 5, choices of data types, optimizer and batchnormalization implementation lead to clear tradeoffs againstperformance and resource costs. Major memory savings areattributable to the use of float16 variables and throughthe use of our l norm-based batch normalization. The bulkof our scheme’s energy savings come from the power-of-two representation of ∂ Y , which eliminates floating-pointoperations from lines 13–14 of Algorithm 2. We also evalu-ated the quantization of ∂ Y into five-bit layer-wise blockfloating-point format, denoted “ int5 ” in Table 5 since theindividual elements are fixed-point values. With this encod-ing, significantly higher accuracy loss was observed thanwhen ∂ Y was quantized into the proposed, equally sizedpower-of-two format, confirming that representation of thisvariable’s range is more important than its precision.Figure 2 shows the memory footprint savings from our pro-posed BNN training method for different optimizers andbatch sizes, again for BinaryNet with the CIFAR-10 dataset.Across all of these, we achieved a geomean reduction of4.86 × . Also observable from Figure 2 is that, for all threeoptimizers, movement from the standard to our proposedBNN training allows the batch size used to increase by 10 × ,facilitating faster completion, without a material increase inmemory consumption. With respect to energy, we saw anestimated geomean 4.49 × reduction, split into contributions nabling Binary Neural Network Training on the Edge Table 5.
Accuracy, memory and energy impacts of moving from standard to our proposed data representations with BinaryNet trained toclassify CIFAR-10. We include block floating-point ∂ X to illustrate the importance of dynamic range over precision for its representation. Optimizer Data type Batchnorm. Top-1 test accuracy Memorysaving ( × ) Energysaving ( × ) ∂ W ∂ Y % ∆ (pp) float32 float32 l . – – – float16 float16 l . − .
03 2 .
00 1 . bool float16 l . − .
81 2 .
27 1 . Adam bool int5 l . − .
62 2 .
50 4 . bool po2 5 l .
47 0 .
73 2 .
50 4 . bool po2 5 l . − .
10 2 .
50 4 . bool po2 5 Proposed . − .
15 3 .
60 4 . float32 float32 l . – – – float16 float16 l .
54 0 .
02 2 .
00 1 . SGD withmomentum bool float16 l . − .
17 2 .
31 1 . bool int5 l . − .
63 2 .
59 4 . bool po2 5 l .
08 0 .
56 2 .
59 4 . bool po2 5 l .
69 0 .
17 2 .
59 4 . bool po2 5 Proposed . − .
07 4 .
07 4 . float32 float32 l . – – – float16 float16 l . − .
02 2 .
00 1 . bool float16 l . − .
84 2 .
37 1 . Bop bool int5 l . − .
83 2 .
72 4 . bool po2 5 l . − .
04 2 .
72 4 . bool po2 5 l . − .
57 2 .
72 4 . bool po2 5 Proposed . − .
10 4 .
92 4 . Baseline: float32 ∂ W and ∂ X with standard ( l ) batch normalization. Table 6.
Test accuracy, memory footprint and per-batch energy consumption of the standard and our proposed training schemes forResNetE-18 classifying ImageNet with Adam used for optimization.
Approximations Top-1 test accuracy Memory Energy/batch% ∆ (pp) GiB Saving ( × ) J Saving ( × ) None . – . – . –All- bfloat16 .
55 0 .
02 29 .
32 1 .
97 162 .
41 1 . bool ∂ W only . − .
27 57 .
80 1 .
00 185 .
08 1 . po2 8 ∂ Y only . − .
01 57 .
84 1 .
00 116 .
06 1 . l batch norm. only . − .
23 57 .
84 1 .
00 185 .
08 1 . Proposed batch norm. only . − .
32 35 .
59 1 .
63 176 .
87 1 . Final combination . − .
25 18 .
54 3 .
12 158 .
44 1 . Baseline: approximation-free training. bool ∂ W and bfloat16 remaining variables with proposed batch normalization. attributable to arithmetic operations and memory traffic by18.27 × and 1.71 × . Figure 2 also shows that test accuracydoes not drop significantly due to our approximations. WithAdam, there were small drops (geomean 0.87 pp), whilewith SGD and Bop we actually saw modest improvements.We trained ResNetE-18, a mixed-precision model with mostconvolutional layers binarized (Bethge et al., 2019), to clas- sify ImageNet. ResNetE-18 represents an exemplary in-stance within a broad class of ImageNet-capable networks,and we believe that similar results should be achievable formodels with which it shares architectural features. Setupspecifics can be found in Appendix B.2.We show the performance of this network and dataset whenapplying each of our proposed approximations in turn, nabling Binary Neural Network Training on the Edge . × . × . × . × . × M e m o r y ( G i B ) Adam . × . × . × . × . × SGD with momentum . × . × . × . × . × Bop100 200 400 500 1000 E n e r gy / b a t c h ( J )
100 200 400 500 1000Batch size 100 200 400 500 1000 . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × T op - t e s t acc u r ac y ( % ) Memory (std) Memory (prop.) Accuracy (std) Accuracy (prop.)Energy (std, ops) Energy (std, mem.) Energy (prop., ops) Energy (prop., mem.)
Figure 2.
Batch size vs training memory footprint, achieved test accuracy and per-batch training energy consumption for BinaryNet withCIFAR-10. The upper plots show memory and accuracy results for the standard and our proposed training flows. In the lower plots, totalenergy is split into compute- and memory-related components. Annotations show reductions vs the standard approach. as well as with the combination we found to work best,in Table 6. Since the Tensor Processing Units we usedhere natively support bfloat16 rather than float16 ,we switched to the former for these experiments. Where bfloat16 variables were used, these were employedacross all layers; the remaining approximations were ap-plied only to binary layers. Despite increasing the precisionof our power-of-two quantized ∂ Y by moving from k = 5 to 8, this scheme unfortunately induced significant accuracydegradation, suggesting incompatibility with large-scaledatasets. Consequently, we disapplied it for our final ex-periment, which saw our remaining three approximationsdeliver memory and energy reductions of 3.12 × and 1.17 × in return for a 2.25 pp drop in test accuracy. While thesesavings are smaller than those of our small-scale experi-ments, we note that ResNetE-18’s first convolutional layeris both its largest and is non-binary, thus its activation stor-age dwarfs that of the remaining layers. We also remarkthat, while ∼ ∂ W are insignificant since ImageNet’s large images result in proportionally small weight memory occupancy. Never-theless, this proof of concept demonstrates the feasibility oflarge-scale neural network training on the edge.
7. Conclusion
In this paper, we introduced the first training scheme tailoredspecifically to BNNs. Moving first to 16-bit floating-pointrepresentations, we selectively and opportunistically approx-imated beyond this based on careful analysis of the stan-dard training algorithm presented by Courbariaux & Bengio.With a comprehensive evaluation conducted across multiplemodels, datasets, optimizers and batch sizes, we showed thegenerality of our approach and reported significant memoryand energy reductions vs the prior art, challenging the no-tion that the resource constraints of edge platforms presentinsurmountable barriers to on-device learning. In the future,we will explore the potential of our training approximationsin the custom hardware setting, within which we expectthere to be vast energy-saving potential through the designof tailor-made arithmetic operators. nabling Binary Neural Network Training on the Edge
Acknowledgments
The authors are grateful for the support of the UnitedKingdom EPSRC (grant numbers EP/P010040/1 andEP/S030069/1). They also wish to thank Sergey Ioffe andMichele Covell for their helpful suggestions.
References
Agarwal, N., Suresh, A. T., Yu, F., Kumar, S., andMcMahan, H. B. CpSGD: Communication-efficient anddifferentially-private distributed SGD. In
InternationalConference on Neural Information Processing Systems ,2018.Alizadeh, M., Fern´andez-Marqu´es, J., Lane, N. D., andGal, Y. An empirical study of binary neural networks’optimisation. In
International Conference on LearningRepresentations , 2018.Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anand-kumar, A. SignSGD: Compressed optimisation for non-convex problems. In
International Conference on Ma-chine Learning , 2018.Bethge, J., Yang, H., Bornstein, M., and Meinel, C. Backto simplicity: How to train accurate BNNs from scratch? arXiv preprint arXiv:1906.08637 , 2019.Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Inger-man, A., Ivanov, V., Kiddon, C., Koneˇcn`y, J., Mazzocchi,S., McMahan, H. B., van Overveldt, T., Petrou, D., Ram-age, D., and Roselander, J. Towards federated learning atscale: System design. In
Conference on Machine Learn-ing and Systems , 2019.Cai, H., Gan, C., Zhu, L., and Han, S. Tiny Transfer Learn-ing: Towards memory-efficient on-device learning. In
IEEE Conference on Computer Vision and Pattern Recog-nition , 2020.Chakrabarti, A. and Moseley, B. Backprop with approxi-mate activations for memory-efficient network training.In
Advances in Neural Information Processing Systems ,2019.Chen, T., Xu, B., Zhang, C., and Guestrin, C. Trainingdeep nets with sublinear memory cost. arXiv preprintarXiv:1604.06174 , 2016.Coelho Jr., C. N., Kuusela, A., Zhuang, H., Aarrestad, T.,Loncar, V., Ngadiuba, J., Pierini, M., and Summers, S.Ultra low-latency, low-area inference accelerators usingheterogeneous deep quantization with QKeras and hls4ml. arXiv preprint arXiv:2006.10159 , 2020.Constantinides, G. A. Rethinking arithmetic for deep neu-ral networks.
Philosophical Transactions of the RoyalSociety A , 378(2166), 2019. Courbariaux, M. and Bengio, Y. BinaryNet: Training deepneural networks with weights and activations constrainedto +1 or -1. arXiv preprint arXiv:1602.02830 , 2016.Courbariaux, M., Bengio, Y., and David, J.-P. BinaryCon-nect: Training deep neural networks with binary weightsduring propagations. In
Conference on Neural Informa-tion Processing Systems , 2015.Ghasemzadeh, M., Samragh, M., and Koushanfar, F. ReB-Net: Residual binarized neural network. In
IEEE In-ternational Symposium on Field-Programmable CustomComputing Machines , 2018.Ginsburg, B., Nikolaev, S., and Micikevicius, P. Training ofdeep networks with half-precision float. In
Nvidia GPUTechnology Conference , 2017.Glorot, X. and Bengio, Y. Understanding the difficulty oftraining deep feedforward neural networks. In
Interna-tional Conference on Artificial Intelligence and Statistics ,2010.Graham, B. Low-precision batch-normalized activations. arXiv preprint arXiv:1702.08231 , 2017.Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., andGraves, A. Memory-efficient backpropagation throughtime. In
Advances in Neural Information ProcessingSystems , 2016.He, X., Mo, Z., Cheng, K., Xu, W., Hu, Q., Wang, P., Liu,Q., and Cheng, J. Proxybnn: Learning binarized neuralnetworks via proxy matrices. In
European Conference onComputer Vision , 2020.Helwegen, K., Widdicombe, J., Geiger, L., Liu, Z., Cheng,K.-T., and Nusselder, R. Latent weights do not exist:Rethinking binarized neural network optimization. In
Ad-vances in Neural Information Processing Systems , 2019.Hoffer, E., Banner, R., Golan, I., and Soudry, D. Normmatters: Efficient and accurate normalization schemesin deep networks. In
Advances in Neural InformationProcessing Systems , 2018.Horowitz, M. Computing’s energy problem (and what wecan do about it). In
International Solid-State CircuitsConference , 2014.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In
International Conference on LearningRepresentations , 2015.Lin, X., Zhao, C., and Pan, W. Towards accurate binaryconvolutional neural network. In
Conference on NeuralInformation Processing Systems , 2017. nabling Binary Neural Network Training on the Edge
Liu, Z., Wu, B., Luo, W., Yang, X., Liu, W., and Cheng,K.-T. Bi-Real Net: Enhancing the performance of 1-bit CNNs with improved representational capability andadvanced training algorithm. In
European Conference onComputer Vision , 2018.Liu, Z., Shen, Z., Savvides, M., and Cheng, K.-T. ReActNet:Towards precise binary neural network with generalizedactivation functions. In
European Conference on Com-puter Vision , 2020.McMahan, B., Moore, E., Ramage, D., Hampson, S., andy Arcas, B. A. Communication-efficient learning of deepnetworks from decentralized data. In
International Con-ference on Artificial Intelligence and Statistics , 2017.Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,Venkatesh, G., and Wu, H. Mixed precision training. In
International Conference on Learning Representations ,2018.Qin, H., Gong, R., Liu, X., Bai, X., Song, J., and Sebe, N.Binary neural networks: A survey.
Pattern Recognition ,105, 2020.Sari, E., Belbahri, M., and Nia, V. P. How does batchnormalization help binary training. arXiv preprintarXiv:1909.09139 , 2019.Sohoni, N. S., Aberger, C. R., Leszczynski, M., Zhang, J.,and R´e, C. Low-memory neural network training: Atechnical report. arXiv preprint arXiv:1904.10631 , 2019.Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M.,Leong, P. H. W., Jahre, M., and Vissers, K. FINN: Aframework for fast, scalable binarized neural networkinference. In
ACM/SIGDA International Symposium onField-Programmable Gate Arrays , 2017.Umuroglu, Y., Akhauri, Y., Fraser, N. J., and Blott, M.LogicNets: Co-designed neural networks and circuits forextreme-throughput applications. In
International Con-ference on Field-Programmable Logic and Applications ,2020.Wang, E., Davis, J. J., Cheung, P. Y. K., and Constan-tinides, G. A. LUTNet: Rethinking inference in FPGAsoft logic. In
IEEE International Symposium on Field-Programmable Custom Computing Machines , 2019a.Wang, E., Davis, J. J., Zhao, R., Ng, H.-C., Niu, X., Luk, W.,Cheung, P. Y. K., and Constantinides, G. A. Deep neu-ral network approximation for custom hardware: Wherewe’ve been, where we’re going.
ACM Computing Surveys ,52(2), 2019b. Wang, E., Davis, J. J., Cheung, P. Y. K., and Constantinides,G. A. LUTNet: Learning FPGA configurations for highlyefficient neural network inference.
IEEE Transactions onComputers , 2020.Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakrish-nan, K. Training deep neural networks with 8-bit floatingpoint numbers. In
Advances in Neural Information Pro-cessing Systems , 2018.Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., andLi, H. TernGrad: Ternary gradients to reduce communi-cation in distributed deep learning. In
Advances in NeuralInformation Processing Systems , 2017.Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht,B. The marginal value of adaptive gradient methods inmachine learning. In
Advances in Neural InformationProcessing Systems , 2017.Wu, S., Li, G., Chen, F., and Shi, L. Training and inferencewith integers in deep neural networks. In
InternationalConference on Learning Representations , 2018a.Wu, S., Li, G., Deng, L., Liu, L., Wu, D., Xie, Y., and Shi, L. l -norm batch normalization for efficient training of deepneural networks. IEEE Transactions on Neural Networksand Learning Systems , 30(7), 2018b.Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou, Y.DoReFa-Net: Training low bitwidth convolutional neuralnetworks with low bitwidth gradients. arXiv preprintarXiv:1606.06160 , 2016. nabling Binary Neural Network Training on the Edge
A. Convergence Rate Analysis
A.1. Theoretical Support . . . . Epoch G r a d i e n t d e n s it y Weight gradient densityWeight noise density
Figure 3.
Weight density of the sixth convolutional layer of Bina-ryNet trained with bool weight and po2 5 activation gradientsusing Adam and the CIFAR-10 dataset.
Bernstein et al. (2018) proved that training non-binary net-works with binary weight gradients may result in similarconvergence rates to those of unquantized training if weightgradient density φ (cid:16) [ µ ( ∂ W ) , · · · , µ ( ∂ W S )] T (cid:17) and weightnoise density φ (cid:16) [ σ ( ∂ W ) , · · · , σ ( ∂ W S )] T (cid:17) remain withinan order of magnitude throughout a training run. Here, S isthe training step size and φ ( • ) = (cid:107)•(cid:107) N (cid:107)•(cid:107) denotes the densityfunction of an N -element vector.We repeated Bernstein et al.’s evaluation with our proposedgradient quantization applied during BinaryNet trainingwith the CIFAR-10 dataset using Adam and hyperparam-eters as detailed in Appendix B.1. The results of this ex-periment can be found in Figure 3. We chose to show thedensities of BinaryNet’s sixth convolutional layer since thisis the largest layer in the network. Each batch of inputswas trained using quantized gradients ∂ ˆ W and ∂ ˜ Y . Thetrained network was then evaluated using the same trainingdata to obtain the float32 (unquantized) ∂ W used toplot the data shown in Figure 3. We found that the weightgradient density ranged from 0.55–0.62, and weight noisedensity 0.92–0.97, therefore concluding that our quantiza-tion method may result in similar convergence rates to theunquantized baseline.It should be noted that Bernstein et al.’s derivations assumed the use of smooth objective functions. Although the forwardpropagation of BNNs is not smooth due to binarization, theirtraining functions still assume smoothness due to the use ofstraight-through estimation. A.2. Empirical Support
Figures 4, 5, 6 and 7 contain the training accuracy curvesof all experiments conducted for this work. The curves ofthe standard and our proposed training methods are broadlysimilar, supporting the conclusion from Appendix A.1 thatour proposals do not induce significant convergence ratechange.
B. Experimental Setup
B.1. Small-Scale Datasets
We used the development-based learning rate schedulingapproach proposed by Wilson et al. (2017) with an initiallearning rate η of 0.001 for all optimizers except for SGDwith momentum, for which we used 0.1. We used batch size B = 100 for all except for Bop, for which we used B = 50 as recommended by Helwegen et al. (2019). MNIST andCIFAR-10 were trained for 1000 epochs; SVHN for 200. B.2. ImageNet
Finding development-based learning rate scheduling to notwork well with ResNetE-18, we resorted to the fixed decayschedule described by Bethge et al. (2019). η began at 0.001and decayed by a factor of 10 at epochs 70, 90 and 110. Wetrained for 120 epochs with B = 4096 . nabling Binary Neural Network Training on the Edge . . . . MLP & MNIST
CNV & CIFAR-10NN, standardBNN, standardNN, referenceBNN, proposed . . . . T op - t r a i n i ng acc u r ac y ( % ) CNV & SVHN
BinaryNet & CIFAR-10 . . . . EpochBinaryNet & SVHN
Figure 4.
Achieved training accuracy over time for experiments reported in Table 3. nabling Binary Neural Network Training on the Edge . . . . T op - t r a i n i ng acc u r ac y ( % ) Adam
SGD with momentumAll- float32 , l All- float16 , l bool ∂ W , float16 ∂ Y , l bool ∂ W , int5 ∂ Y , l bool ∂ W , po2 5 ∂ Y , l bool ∂ W , po2 5 ∂ Y , l bool ∂ W , po2 5 ∂ Y , prop. . . . . EpochBop
Figure 5.
Achieved training accuracy over time for experiments reported in Table 5. nabling Binary Neural Network Training on the Edge . . . B a t c h s i ze Adam SGD with momentum BopStandardProposed . . . B a t c h s i ze . . . B a t c h s i ze T op - t r a i n i ng acc u r ac y ( % ) . . . B a t c h s i ze . . . B a t c h s i ze Epoch
Figure 6.