E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings
Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, Zhangyang Wang
EE -Train: Training State-of-the-art CNNs with Over80% Energy Savings Yue Wang (cid:5)∗ , Ziyu Jiang †∗ , Xiaohan Chen †∗ , Pengfei Xu (cid:5) , Yang Zhao (cid:5) , Yingyan Lin (cid:5)‡ and
Zhangyang Wang †‡†
Department of Computer Science and Engineering, Texas A&M University (cid:5)
Department of Electrical and Computer Engineering, Rice University † {jiangziyu, chernxh, atlaswang}@tamu.edu (cid:5) {yw68, px5, zy34, yingyan.lin}@rice.edu http://rtml.eiclab.net/publications/e2-train Abstract
Convolutional neural networks (CNNs) have been increasingly deployed to edgedevices. Hence, many efforts have been made towards efficient CNN inference in resource-constrained platforms. This paper attempts to explore an orthogonaldirection: how to conduct more energy-efficient training of CNNs, so as to enableon-device training. We strive to reduce the energy cost during training, by droppingunnecessary computations from three complementary levels: stochastic mini-batchdropping on the data level ; selective layer update on the model level ; and signprediction for low-cost, low-precision back-propagation, on the algorithm level .Extensive simulations and ablation studies, with real energy measurements from anFPGA board, confirm the superiority of our proposed strategies and demonstrateremarkable energy savings for training. For example, when training ResNet-74on CIFAR-10, we achieve aggressive energy savings of >90% and >60%, whileincurring a top-1 accuracy loss of only about 2% and 1.2%, respectively. Whentraining ResNet-110 on CIFAR-100, an over 84% training energy saving is achievedwithout degrading inference accuracy.
The increasing penetration of intelligent sensors has revolutionized how Internet of Things (IoT)works. For visual data analytics, we have witnessed the record-breaking predictive performanceachieved by convolutional neural networks (CNNs) [1, 2, 3]. Although such high performance CNNmodels are initially learned in data centers and then deployed to IoT devices, we have witnessedan increasing necessity for the model to continue learning and updating itself in situ , such as forpersonalization for different users, or incremental/lifelong learning. Ideally, this learning/retrainingprocess should take place on the device. Comparing to cloud-based retraining, training locally helpsavoid transferring data back and forth between data centers and IoT devices, reduce communicationcost/latency, and enhance privacy.However, training on IoT devices is non-trivial and more time/resource-consuming yet much lessexplored than inference. IoT devices, such as smart phones and wearables, have limited computationand energy resources, which are even stringent for inference. Training CNNs consumes magnitudeshigher number of computations than one inference. For example, training ResNet-50 for only one224 ×
224 image can take up to 12 GFLOPs (vs. 4GFLOPS for inference), which can easily draina mobile phone battery when training with batch images [4]. This mismatch between the limited ∗ The first three authors (Yue Wang, Ziyu Jiang, Xiaohan Chen) contributed equally. ‡ Correspondence should be addressed to: Yingyan Lin and Zhangyang Wang.33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ c s . L G ] D ec esources of IoT devices and the high complexity of CNNs is only getting worse because the networkstructures are getting more complex as they are designed to solve harder and larger-scale tasks [5].This paper considers the most standard CNN training setting, assuming both the model structure andthe dataset to be given. This “basic” training setting is not usually the realistic IoT case, but we addressit as a starting point (with familiar benchmarks), and an opening door towards obtaining a toolbox thatlater may be extended to online/transfer learning too (see Section 5). Our goal is to reduce the totalenergy cost of training , which is complicated by a myriad of factors: from per-sample (mini-batch)complexity (both feed-forward and backward computations), to the empirical convergence rate (howmany epochs it takes to converge), and more broadly, hardware/architecture factors such as dataaccess and movements [6, 7, 8]. Despite a handful of works on efficient, accelerated CNN training[9, 10, 11, 12, 13], they mostly focus on reducing the total training time in resource-rich settings,such as by distributed training in large-scale GPU clusters. In contrast, our focus is to trim down thetotal energy cost of in-situ, resource-constrained training . This represents an orthogonal (and lessstudied) direction to [9, 10, 11, 12, 13, 14, 15], although the two can certainly be combined.To unleash the potential of more energy-efficient in-situ training, we look at the full CNN traininglifecycle closely. With the goal of “squeezing out” unnecessary costs, we raise three curious questions: • Q1: Are all samples always required throughout training : is it necessary to use all training samplesin all epochs? • Q2: Are all parts of the entire model equally important during training : does every layer or filterhave to be updated every time? • Q3: Are precise gradients indispensable for training : can we efficiently compute and update themodel with approximate gradients?The above three questions only represent our “first stab” ideas for exploring energy-efficient training,whose full scope is much more profound. By no means do our questions above represent all possibledirections. We envision that many other recipes can be blended too, such as training on lower bitprecision or input resolution [16, 17]. We also recognize that energy-efficient CNN training should bejointly considered with hardware/architecture co-design [18, 19], which is beyond the current work.Motivated by the above questions, this paper proposes a novel energy efficient CNN training frame-work dubbed E -Train . It consists of three complementary aspects of efforts to trim down unnecessarytraining computations and data movements, each addressing one of the above questions: • Data-Level: Stochastic mini-batch dropping (SMD).
We show that CNN training could beaccelerated by a “frustratingly easy” strategy: randomly skipping mini-batches with 0.5 probabilitythroughout training. This could be interpreted as data sampling with (limited) replacements, and isfound to incur minimal accuracy loss (and sometimes even increases accuracy). • Model-Level: Input-dependent selective layer update (SLU).
For each minibatch, we selecta different subset of the CNN layers to be updated. The input-adaptive selection is based on alow-cost gating function jointly learned during training. While similar ideas were explored inefficient inference [20], for the first time, they are applied to and evaluated for training for the firsttime. • Algorithm-Level: Predictive sign gradient descent (PSG).
We explore the usage of an ex-tremely low-precision gradient descent algorithm called SignSGD, which has recently found boththeoretical and experimental grounds [21]. The original algorithm still requires the full gradientcomputation and therefore does not save energy. We create a novel “predictive” variant, that couldobtain the sign without computing the full gradient via low-cost, bit-level prediction. Combinedwith mixed-precision design, it decreases both computation and data-movement costs.Besides mainly experimental explorations, we find E -Train has many interesting links to recent CNNtraining theories, e.g., [22, 23, 24, 25]. We evaluate E -Train in comparison with its closest state-of-the-art competitors. To measure its actual performance, E -Train is implemented and evaluated on anFPGA board . The results show that the CNN model applied with E -Train consistently achieveshigher training energy efficiency with marginal accuracy drops. Accelerated CNN training.
A number of works have been devoted to accelerating training, in aresource-rich setting, by utilizing communication-efficient distributed optimization and larger mini-batch sizes [9, 10, 11, 12]. The latest work [13] combined distributed training with a mixed precision2ramework, leading to training AlexNet within 4 minutes. However, their goals and settings aredistinct from ours - while the distributed training strategy can reduce time, it will actually incur moretotal energy overhead, and is clearly not applicable to on-device resource-constrained training.
Low-precision training.
It is well known that CNN training can be performed under substantiallylower precision [14, 15, 16], rather than by using full-precision floats. Specifically, training withquantized gradients has been well studied in the distributed learning, whose main motivation is toreduce the communication cost during gradient aggregations between workers [21, 26, 27, 28, 29, 30].A few works considered to only transmit the coordinates of large magnitudes [31, 32, 33]. Recently,the SignSGD algorithm [21, 26] even showed the feasibility of using one-bit gradients (signs) duringtraining, without notably hampering the convergence rate or final result. However, most algorithmsare optimized for distributed communication efficiency, rather than for reducing training energy costs.Many of them, including [21], need to first compute full-precision gradients and then quantize them.
Efficient CNN inference: Static and Dynamic.
Compressing CNNs and speeding up their inferencehave attracted major research interests in recent years. Representative methods include weight pruning,weight sharing, layer factorization, and bit quantization, to name just a few [34, 35, 36, 37, 38].While model compression presents “static” solutions for improving inference efficiency, a moreinteresting recent trend is to look at dynamic inference [20, 39, 40, 41, 42, 43] to reduce the latency,i.e., selectively executing subsets of layers in the network conditioned on each input. That sequentialdecision making process is usually controlled by low-cost gating or policy networks. This mechanismhas also been applied to improve inference energy efficiency [44, 45].In [46], a unique bit-level prediction framework called
PredictiveNet was presented to accelerateCNN inference at a lower level. Since CNN layer-wise activations are usually highly sparse, theauthors proposed to predict those zero locations using low-cost bit predictors, thereby bypassing alarge fraction of energy-dominant convolutions without modifying the CNN structure.Energy-efficient training is different from and more complicated than its inference counterpart.However, many insights gained from the latter can be lent to the former. For example, a recent work[47] showed that performing active channel pruning during training can accelerate the empiricalconvergence. Our proposed model-level SLU is inspired by [20]. The algorithm-level PSG alsoinherits the idea of bit-level low-cost prediction from [46].
We first adopt a straightforward, seemingly naive, yet surprisingly effective stochastic mini-batchdropping (SMD) strategy (see Fig. 1), to aggressively reduce the training cost by letting it see lessmini-batches. At each epoch, SMD simply skips every mini-batch with a default probability of . .All other training protocols, such as learning rate schedule, remain unchanged. Compared to thenormal training, SMD can directly half the training cost, if both were trained with the same number ofepochs. Yet amazingly, we observe in our experiments that SMD usually leads to negligible accuracydecrease, and sometimes even an increase (see Sec. 4). Why? We discuss possible explanationsbelow.SMD can be interpreted as sampling with limited replacement . To understand this, think of combiningtwo consecutive SMD-enforced epochs into one, then it has the same number of mini-batches asone full epoch; but within it each training sample now has a 0.25, 0.5, and 0.25 probability, of beingsampled 2, 1, and 0 times, respectively. The conventional wisdom is that for stochastic gradientdescent (SGD), in each epoch, the mini-batches are sampled i.i.d. from data without replacement (i.e., each sample occurs exactly once per epoch) [48, 49, 50, 51, 52]. However, [22] proved thatsampling mini-batches with replacement has a larger variance than sampling without replacement,and consequently SGD may have better regularization properties.Alternatively, SMD could also be viewed as a special data augmentation way that injects moresampling noise to perturb training distribution every epoch. Past works [53, 54, 55] have shownthat specific kinds of random noise aid convergence through escaping from saddle points or lessgeneralizable minima. The structured sampling noise caused by SMD might aid this exploration.Besides, [23, 56, 57] also showed that an importance sampling scheme that focuses on training morewith “informative” examples leads to faster convergence under resource budgets. They implied that3 igure 1: An illustration of the proposed framework. SLU: each blue circle G indicates an RNN gate andeach blue square under G indicates one block of layers in the base model. Green arrows denote the backwardpropagation. The RNN gates generate strategies to select which layers to train for each input. In this specificexample, the second and fourth blocks are “skipped” for both feedforward and backward computations .Only the first and third blocks are updated. The details of SMD and PSG are described in the main text. the mini-batch dropping can be selective based on certain information critera instead of stochastic.We use SMD because it has zero overhead, but more effective dropping options might be available iflow-cost indicators of mini-batch importance can be identified: we leave this as future work. [20] proposed to dynamically skip a subset of layers for different inputs, in order to adaptivelyaccelerate the feed-forward inference. However, [20] called for a post process after supervisedtraining, i.e., to refine the dynamic skipping policy via reinforcement learning, thus causing undesiredextra training overhead. We propose to extend the idea of dynamic inference to the training stage,i.e., dynamically skipping a subset of layers during both feed-forward and back-propagation .Crucially, we show that by adding an auxiliary regularization , such dynamic skipping can be learnedfrom scratch and obtain satisfactory performance: neither post refinement nor extra trainingiterations ar required . That is critical for dynamic layer skipping to be useful for energy-efficienttraining: we term this extended scheme as input-dependent selective layer update (SLU).As depicted in Fig. 1, given a base CNN to be trained, we follow [20] to add a light-weight RNNgating network per layer block. Each gate takes the same input as its corresponding layer, and outputssoft-gating indicators between [0,1], which are then used as the skipping probability, i.e., the higherthe value is, the more likely that layer will be selected. Therefore, each layer will be adaptivelyselected or skipped, depending on the inputs. We will only select the layers activated by gates. TheRNN gates cost less than 0.04% feed-forward FLOPs than the base models; hence their energyoverheads are negligible. More details can be found in the supplement.[20] first trained the gates in a supervised way together with the base model. Observing that suchlearned routing policies were often not sufficiently efficient, they used reinforcement learning post-processing to learn more aggressive skipping afterwards. While this is fine for the end goal of dynamicinference, we hope to get rid of the post-processing overhead. We incorporate the computationalcomplexity regularization into the objective function to overcome this hurdle, defined as: min W,G L ( W, G ) + αC ( W, G ) (1)Here, α is a weighting coefficient of the computational complexity regularization, and W and G denote the parameters of the base model and the gating network, respectively. Also, L ( W, G ) denotes the prediction loss, and C ( W, G ) is calculated by accumulating the computational cost(FLOPs) of the layers that are selected. The regularization explicitly encourages learning more“parismous” selections throughout the training. We find that such SLU-regularized training leads toalmost the same number of epochs to converge compared to standard training, i.e., SLU does notsacrifice empirical convergence speed. As a side effect, SLU will naturally yield CNNs with dynamicinference capability. Though not the focus of this paper, we find the CNN trained with SLU reaches acomparable accuracy-efficiency trade-off over the ones trained with the approach in [20].The practice of SLU seems to align with several recent theories on CNN training. In [58], theauthors suggested that “not all layers are created equal” for training. Specifically, some layers are4ritical to be intensively updated for improving final predictions, while others are insensitive alongtraining. There exist “non-critical” layers that barely change their weights throughout training: evenresetting those layers in a trained model to their initial value has few negative consequences. Themore recent work [25] further confirmed the phenomenon, though how to identify those non-criticalmodel parts at the early training stage remains unclear. [59, 60] also observed that different samplesmight activate different sub-models. Those inspiring theories, combined with the dynamic inferencepractice, motivate us to propose SLU for more efficient training. It is well recognized that low-precision fixed-point implementation is a very effective knob forachieving energy efficient CNNs, because both CNNs’ computational and data movement costs areapproximately a quadratic function of their employed precision. For example, a state-of-the-art design[61] shows that adopting 8-bit precision for a multiplication, adder, and data movement can reduceenergy cost by 95%, 97%, and 75%, respectively, as compared to that of a 32-bit floating point designwhen evaluated in a commercial 45nm CMOS technology.The successful adoption of extremely low-precision (binary) gradients in SignSGD [21] is appealing,as it might lead to reducing both weight update computation and data movements. However, directlyapplying the original SignSGD algorithm for training will not save energy, because it actuallycomputes the full-precision gradient first before taking the signs. We propose a novel predictivesign gradient descent (PSG) algorithm, which predicts the sign of gradients using low-cost bit-levelpredictors, thereby completely bypassing the costly full-gradient computation.We next introduce how the gradients of weights are updated in PSG. Assume the following notations:the full precision and m ost s ignificant b its (the latter, MSB part, is adopted as PSG’s low-costpredictors) of the input x and the gradient of the output g y are denoted as ( B x , B g ) and ( B msb x , B msb g ),respectively, where the corresponding input and the gradient of the output for PSG’s predictors aredenoted as x msb and g msb y , respectively. As such, the quantization noise for the input and the gradientof the output are q x = x − x msb and q g y = g y − g msb y , respectively. Similarly, after back-propagation,we denote the full-precision and low-precision (i.e., taking the most significant bits (MSBs)) gradientof the weight as g w and g msb w , respectively, the latter of which is computed using x msb and g msb y .Then, with an empirically pre-selected threshold τ , PSG updates the i -th weight gradient as follows: ˜ g w [ i ] = (cid:26) sgn ( g msb w [ i ]) , | g msb w [ i ] | ≥ τ sgn ( g w [ i ]) , otherwise (2)Note that in hardware implementation, the computation to obtain g msb w is embedded within that of g w .Therefore, the PSG’s predictors do not incur energy overhead. PSG for energy-efficient training.
Recent work [16] has shown that most of the training process isrobust to reduced precision (e.g., 8 bits instead of 32 bits), except for the weight gradient calculationsand updates. Taking their learning, we similarly adopt a higher precision for the gradients thanthe inputs and weights, i.e., B g y > B x = B w . Specifically, when training with PSG, we firstcompute the predictors using B msb x (e.g., B msb x =4 ) and B msb g y (e.g., B msb g y =10 ), and then updatethe weights’ gradients following Eq. (2). The further energy savings of training with PSG over thefixed-point training [16] results from the fact that the predictors computed using x msb and g msb y require exponentially less computational and data movement energy. Prediction guarantee of PSG.
We analyze the probability of PSG’s prediction failure to discuss itsperformance guarantee. Specifically, if denoting the sign prediction failure produced by Eq. (2) as H ,it can be proved that this probability is upbounded as follows, P ( H ) ≤ ∆ x E + ∆ g y E , (3)where ∆ x = 2 − ( B msb x − and ∆ g y = 2 − ( B msb gy − are the quantization noise step sizes of x msb and g msb y , respectively, and E and E are given in the Appendix along with the proof of Eq. (3),which inherits the spirit in [47, 62]. Eq. (3) shows that the prediction failure probability of PSGis upbounded by a term that degrades exponentially with the precision assigned to the predictors,indicating that this failure probability can be very small if the predictors are designed properly. Adaptive threshold.
Training with PSG might lead to sign flips in the weight gradients as comparedto that of the floating point one, which only occurs when the latter has a small magnitude and thus the5uantization noise of the predictors causes the sign flips. Therefore, it is important to properly selecta threshold (e.g., τ in Eq.(2)) that can optimally balance this sign flip probability and the achievedenergy savings. We adopt an adaptive threshold selection strategy because the dynamic range ofgradients differ significantly from layers to layers: instead of using a fixed number, we will tune aratio β ∈ (0 , which yields the adaptive threshold as ˜ τ = β max i { g msb w [ i ] } . Datasets: We evaluate our proposed techniques on two datasets: CIFAR-10 and CIFAR-100. Commondata augmentation methods (e.g., mirroring/shifting) are adopted, and data are normalized as in[63]. Models: Three popular backbones, ResNet-74, ResNet-110 [64], and MobileNetV2 [65], areconsidered. For evaluating each of the three proposed techniques (i.e., SMD, SLU, and PSG), weconsider various experimental settings using the ResNet-74 and CIFAR-10 dataset for ablation studies,as described in Sections 4.2-4.5. ResNet-110 and MobileNetV2 results are reported in Section 4.6.Top-1 accuracies are measured for CIFAR-10, and both top-1 and top-5 accuracies for CIFAR-100.Training settings: We adopt the training settings in [64] for the baseline default configurations.Specifically, we use an SGD with a momentum of 0.9 and a weight decaying factor of 0.0001, andthe initialization introduced in [66]. Models are trained for 64k iterations. For experiments wherePSG is used, the initial learning rate is adjusted to . as SignSGD[21] suggested small learningrates to benefit convergence. For others, the learning rate is initially set to be 0.1 and then decayedby 10 at the 32k and 48k iterations, respectively. We also employed the stochastic weight averaging(SWA) technique [67], , which was found to notably stabilize training, when PSG is adopted.Real energy measurements using FPGA: As the energy cost of CNN inference/training consistsof both computational and data movement costs, the latter of which is often dominant but cannot be captured by the commonly used metrics, such as the number of FLOPs [6], we evaluate theproposed techniques against the baselines in terms of accuracy and real measured energy consumption.Specifically, unless otherwise specified, all the energy or energy savings are obtained through realmeasurements by training the corresponding models and datasets in a state-of-the-art FPGA [68],which is a digilent ZedBoard Zynq-7000 ARM/FPGA SoC Development Board. Fig. 2 shows ourFPGA measurement setup, in which the FPGA board is connected to a laptop through a serial portand a power meter. In particular, the training settings are downloaded from the laptop to the FPGAboard, and the real-measured energy consumption is obtained via the power meter for the wholetraining process and then sent back to the laptop. All energy results are measured from FPGA.
Figure 2:
The energy measurement setup with (from left to right) a MAC Air latptop, a Xilinx FPGA board[68], and a power meter.
We first validate the energy savings achieved by SMD against a few “off-the-shelf” options: (1) canwe train with the standard algorithm, using fewer iterations and otherwise the same training protocol?(2) can we train with the standard algorithm, using fewer iterations but properly increased learningrates? Two sets of carefully-designed experiments are presented below for addressing them.
Training with SMD vs. standard mini-batch (SMB):
We first evaluate SMD over the standardmini-batch (SMB) training, which uses all (vs. 50% in SMD) mini-batch samples. As shown in Fig.3a when the energy ratio is 1 (i.e., training with SMB + 64k iterations vs. SMD + 128k iterations),the proposed SMD technique is able to boost the inference accuracy by 0.39% over the standard way.We next “naively” suppress the energy cost of SMB, by reducing training iterations. Specifically,we reduce the SMB training iterations to be { , , , , , , } of the original one. Note thelearning rate schedule (e.g., when to reduce learning rates) will be scaled proportionally with thetotal iteration number too. For comparison, we conduct experiments of training with SMD when6 .5 0.6 0.7 0.8 0.9 1.0Energy Ratio92.492.893.293.694.0 T e s t A cc u r a c y (a) T e s t A cc u r a c y (b) Figure 3:
The top-1 accuracy of CIFAR-10 testing using the Resnet-74 model when: (a) training with SMDversus the standard mini-batch (SMB) method, with the training energy ratios of each ranging from 0.5 to 1. Thesize of markers is drawn proportionally with the measured training energy cost; and (b) training with SMD, andSMB with different increased learning rates – all under the same training energy budget. the number of equivalent training iterations are the same as those of the SMB cases. Fig. 3a showsthat training with SMD consistently achieves a higher inference accuracy than SMB with the marginranging from 0.39% to 0.86%. Furthermore, we can see that training with SMD reduces the trainingenergy cost by . while boosting the inference accuracy by . (see the cases of SMD under theenergy ratio of 0.67 vs. SMB under the energy ratio of 1, respectively, in Fig. 3a), as compared toSMB. We adopt SMD under this energy ratio of . in all the remaining experiments. T e s t A cc u r a c y Figure 4:
Inference accuracy vs. energy ratios, wherethe energy ratios are obtained by normalizing the corre-sponding energy over that of the original one (SMB +64k iterations).
Table 1:
The accuracy of SMD on other datasets andbackbones (energy ratio . ). Dataset Backbone AccuracySMB SMDCIFAR-10 ResNet-110 92.75%
CIFAR-100 ResNet-74 71.11%
We repeated training ResNet-74 on CIFAR-10 us-ing SMD for 10 iterations with different randominitializations. The accuracy standard deviationis only 0.132%, showing high stability. We alsoconducted more experiments with different back-bones and datasets . As shown in Tab. 1, SMD isconsistently better than SMB.
Training with SMD vs. SMB + increasedlearning rates:
We further compare with SMBwith tuned/larger learning rates, conjecturing thatit would accelerate convergence by possibly re-ducing the training epochs needed. The resultsare summarized in Fig. 3b. Specifically, when thenumber of iterations are reduced by , we do agrid search of learning rates, with a step size from . between [ . , . ]. All compared methodsare set with the same training energy budget. Fig.3b demonstrates that while increasing learningrates seems to improve SMB’s energy efficiencyover sticking to the original protocol, our pro-posed SMD still maintains a clear advantage of atleast . . Our current SLU experiments are based on CNNs with residual connections, partially because theydominate in SOTA CNNs. We will extend SLU to other model structures in future work. We evaluatethe proposed SLU by comparing it with stochastic depth (SD) [69], a technique originally developedfor training very deep networks effectively, by updating only a random subset of layers at eachmini-batch. It could be viewed as a “random” version of SLU (which uses learned layer selection).We follow all suggested settings in [69]. For a fair comparison, we adjust the hyper-parameter p L [69], so that the SD dropping ratio is always the same as SLU.From Fig. 4, training with SLU consistently achieves higher inference accuracy than SD when theirtraining energy costs are the same. It is further encouraging to observe that training with SLU could7ven achieve higher accuracy sometimes in addition to saving energy. For example, comparing thecases when training with SLU + an energy ratio of 0.3 (i.e., energy savings) and that of SD + anenergy ratio of 0.5, the proposed SLU technique is able to reduce the training energy cost by while boosting the inference accuracy by . . These results endorses the usage of data-drivengates instead of random dropping, in the context of energy-efficient training. Training with SLU +SMD combined further boosts the accuracy while reducing the energy cost. Furthermore, 20 trials ofSLU experiments to ResNet38 on CIFAR-10 conclude that, with a 95% confidence level, that theconfidence interval for the mean of the top-1 accuracy and the energy savings are [92.47%, 92.58%](baseline:92.50%) and [39.55%, 40.52%], respectively, verifying SLU’s trustworthy effectiveness. We evaluate PSG against two alternatives: (1) the 8-bit fixed point training proposed in [16]; and (2)the original SignSGD [21]. For all experiments in Sections 4.4 and 4.5, we adopt 8-bit precision forthe activations/weights and 16-bit for the gradients. The corresponding precision of the predictorsare 4-bit and 10-bit, respectively. We use an adaptive threshold (see Section 3.3) of β =0 . . Moreexperiment details are in the Appendix. Table 2:
Comparing the inference accuracy andachieved energy savings (over the 32-bit floating pointtraining) when training with SGD, 8-bit fixed point [16],SignSGD, and PSG using Resnet-74 and CIFAR-10.
Method 32-bitSGD 8-bit[16] SignSGD[21] PSGAccuracy 93.52% 93.24% 92.54% 92.59%Energy savings - 38.62% - 63.28%
Table 3:
The inference accuracy and energy sav-ings (over the 32-bit floating point training) of theproposed E -Train under different (averaged) SLUskipping ratios and adaptive thresholds (i.e., β inSection 3.3) when using Resnet-74 and CIFAR-10. Skipping 20% 40% 60%Accuracy ( β =0 . ) 92.12% 91.84% 91.36%Accuracy ( β =0 . ) 92.15% 91.72% 90.94%Computational savings 80.27% 85.20% 90.13%Energy savings 84.64% 88.72% 92.81% As shown in Table 2, while the 8-bit fixed point training in [16] saves about training energy(going from 32-bit to 8-bit in general leads to about energy savings, which is compromised bytheir employed 32-bit gradients in this case) with a marginal accuracy loss of . as comparedto the 32-bit SGD, the proposed PSG almost doubles the training energy savings ( vs. for[16]) with a negligible accuracy loss of . ( . vs. . for [16]). Interestingly, PSGslightly boosts the inference accuracy by . while saving energy, i.e., × better trainingenergy efficiency with a slightly better inference accuracy, compared to SignSGD [21]. Besides, aswe observed, the ratio of using g msb w for sign prediction typically remains at least 60% throughout thetraining process, given an adaptive threshold β = 0 . . -Train: SMD + SLU + PSG T e s t E rr o r Train
Figure 5:
Inference accuracy vs. energy costs, at different stages oftraining, (i.e., the empirical convergence curves), when training withSMB, SD, merely SLU, SLU + SMD, and E -Train on CIFAR-10with ResNet-74. We now evaluate the proposed E -Train framework, which combines theSMD, SLU, and PSG techniques. Asshown in Table 3, we can see that E -Train: (1) indeed can further boostthe performance as compared to train-ing with SMD+SLU (e.g., E -Trainachieves a higher accuracy of . ( . vs. . (see Fig.4 at theenergy ratio of 0.2) of training withSMD+SLU, when achieving en-ergy savings); and (2) can achieve anextremely aggressive energy savingsof > and > , while incur-ring an accuracy loss of only about . and . , respectively, as com-pared to that of the 32-bit floatingpoint SGD (see Table 2), i.e., up to × better training energy efficiency with small accuracy loss . Impact on empirical convergence speed.
We plot the training convergence curves of differentmethods in Fig. 5, with the x-axis represented in the alternative form of training energy costs (up to8urrent iteration). We observe that E -Train does not slow down the empirical convergence. In fact, iteven makes the training loss decrease faster in the early stage. Experiments on adapting a pre-trained model.
We perform a proof-of-concept experiment forCNN fine-tuning by splitting CIFAR-10 training set into half, where each class was i.i.d. splitevenly. We first pre-train ResNet-74 on the first half, then fine-tune it on the second half. Duringfine-tuning, we compare two energy-efficient options: (1) fine-tuning only the last FC layer usingstandard training; (2) fine-tuning all layers using E -Train. With all hyperparameters being tunedto best efforts, the two fine-tuning methods improve over the pre-trained model top-1 accuracy by[0.30%, 1.37%] respectively, while (2) saves 61.58% more energy (FPGA-measured) than (1). Thisshows that E -Train is the preferred option, leading to a higher accuracy and more energy savingsTable 4 evaluates E -Train and its ablation baselines on various models and more datasets. Theconclusions are aligned with the ResNet-74 cases. Remarkably, on CIFAR-10 with ResNet-110,E -Train saves over 83% energy with only 0.56% accuracy loss. When saving over 91% (i.e., morethan 10 × ), the accuracy drop is still less than 2%. On CIFAR-100 with ResNet-110, E -Train caneven surpass baseline on both top-1 and top5 accuracy while saving over 84% energy. More notably,E -Train is also effective for even compact networks: it saves about 90% energy cost while achievinga comparable accuracy, when adopted for training MobileNetV2. Table 4:
Experiment results with ResNet-110 and MobileNetV2 on CIFAR-10/CIFAR-100.
Dataset Method Backbone Computational Savings(FLOPs) Energy Savings (FPGAmeasured) Accuracy(top-1) Accuracy(top-5)CIFAR-10 SMB (original) ResNet-110 - - 93.57% -SD[69] 50% 46.03% 91.51% -SMB (original) MobileNetV2[70] - - 92.47% -E -Train(SMD+SLU+PSG) ResNet-110 80.27% 83.40% 93.01% -85.20% 87.42% 91.74% -90.13% 91.34% 91.68% -MobileNetV2[70] 75.34% 88.73% 92.06% -CIFAR-100 SMB (original) ResNet-110 - - 71.60% 91.50%SD[69] 50% 48.34% 70.40% 92.58%SMB (original) MobileNetV2[70] - - 71.91% -E -Train(SMD+SLU+PSG) ResNet-110 80.27% 84.17% 71.63% 91.72%85.20% 88.72% 68.61% 89.84%90.13% 92.90 % 67.94% 89.06%MobileNetV2[70] 75.34% 88.17% 71.61% - We propose the E -Train framework to achieve energy-efficient CNN training in resource-constrainedsettings. Three complementary aspects of efforts to trim down training costs - from data, model andalgorithm levels, respectively, are carefully designed, justified, and integrated. Experiments on bothsimulation and a real FPGA demonstrate the promise of E -Train. Despite the preliminary success,we are aware of several limitations of E -Train, which also points us to the future road map. Forexample, E -Train is currently designed and evaluated for standard off-line CNN training, with alltraining data presented in batch, for simplicity. This is not scalable for many real-world IoT scenarios,where new training data arrives sequentially in a stream form, with limited or no data buffer/storageleading to the open challenge of “on-the-fly” CNN training [71]. In this case, while both SLU andPSG are still applicable, SMD needs to be modified, e.g., by one-pass active selection of stream-indata samples. Besides, SLU is not yet straightforward to be extended to plain CNNs without residualconnections. We expect finer-grained selective model updates, such as online channel pruning [47],to be useful alternatives here. We also plan to optimize E -Train for continuous adaptation or lifelonglearning. Acknowledgments
The work is in part supported by the NSF RTML grant (1937592, 1937588). The authors would liketo thank all anonymous reviewers for their tremendously useful comments to help improve our work.
References [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutionalneural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances inNeural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
2] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurateobject detection and semantic segmentation. In Proceedings of the IEEE conference on computer visionand pattern recognition, volume abs/1311.2524, pages 580–587, 2014.[3] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 1701–1708, 2014.[4] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural net-works using energy-aware pruning. 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Jul 2017.[5] Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and JakobUszkoreit. One model to learn them all. CoRR, abs/1706.05137, 2017.[6] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficient reconfigurable accelerator fordeep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, Jan 2017.[7] Y. Lin, S. Zhang, and N. R. Shanbhag. Variation-tolerant architectures for convolutional neural networks inthe near threshold voltage regime. In 2016 IEEE International Workshop on Signal Processing Systems(SiPS), pages 17–22, Oct 2016.[8] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann. An always-on 3.8 µ j/86processor withall memory on chip in 28-nm cmos. IEEE Journal of Solid-State Circuits, 54(1):158–172, Jan 2019.[9] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, AndrewTulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour.CoRR, abs/1706.02677, 2017.[10] Minsik Cho, Ulrich Finkler, Sameer Kumar, David S. Kung, Vaibhav Saxena, and Dheeraj Sreedhar.Powerai ddl. CoRR, abs/1708.02188, 2017.[11] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes.Proceedings of the 47th International Conference on Parallel Processing - ICPP 2018, 2018.[12] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 onimagenet in 15 minutes. CoRR, abs/1711.04325, 2017.[13] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, ZhenyuGuo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision:Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.[14] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limitednumerical precision. In International Conference on Machine Learning, pages 1737–1746, 2015.[15] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deepneural networks with 8-bit floating point numbers. In Advances in neural information processing systems,pages 7675–7684, 2018.[16] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neuralnetworks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.[17] Ting-Wu Chin, Ruizhou Ding, and Diana Marculescu. Adascale: Towards real-time video object detectionusing adaptive scaling. arXiv preprint arXiv:1902.02910, 2019.[18] Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Dong Wu, Yuan Xie, and Luping Shi. L1-norm batchnormalization for efficient training of deep neural networks. IEEE transactions on neural networks andlearning systems, 2018.[19] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normaliza-tion schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170,2018.[20] Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E Gonzalez. Skipnet: Learning dynamicrouting in convolutional networks. In Proceedings of the European Conference on Computer Vision(ECCV), pages 409–424, 2018.[21] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD:Compressed Optimisation for Non-Convex Problems. In International Conference on Machine Learning(ICML-18), 2018.
22] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, convergesto limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages1–10. IEEE, 2018.[23] Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning withimportance sampling, 2018.[24] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal? arXiv preprintarXiv:1902.01996, 2019.[25] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. In ICLR 2019, 2019.[26] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and itsapplication to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of theInternational Speech Communication Association, 2014.[27] Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Randomized quantization forcommunication-optimal stochastic gradient descent. arXiv preprint arXiv:1610.02132, 2016.[28] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear modelswith end-to-end low precision, and a little bit of deep learning. In Proceedings of the 34th InternationalConference on Machine Learning-Volume 70, pages 4035–4043. JMLR. org, 2017.[29] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. Understanding and optimizingasynchronous low-precision stochastic gradient descent. In ACM SIGARCH Computer Architecture News,volume 45, pages 561–574. ACM, 2017.[30] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternarygradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 30, pages 1509–1519. Curran Associates, Inc., 2017.[31] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXivpreprint arXiv:1704.05021, 2017.[32] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducingthe communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.[33] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficientdistributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.[34] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networkswith pruning, trained quantization and huffman coding. In International Conference on LearningRepresentations, 2016.[35] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In2017 IEEE International Conference on Computer Vision (ICCV), pages 1398–1406. IEEE, 2017.[36] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke.Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In ACM SIGARCH ComputerArchitecture News, volume 45, pages 548–560. ACM, 2017.[37] Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan Lin. Deepk-means: Re-training and parameter sharing with harder cluster assignments for compressing deep convo-lutions. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference onMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5363–5372, Stock-holmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.[38] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-localnetworks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 8924–8933, 2019.[39] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, andRogerio Feris. Blockdrop: Dynamic inference paths in residual networks. 2018 IEEE/CVF Conference onComputer Vision and Pattern Recognition, Jun 2018.[40] Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. Lecture Notesin Computer Science, page 3–18, 2018.
41] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 30, pages 2181–2191. Curran Associates, Inc., 2017.[42] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. Gaternet: Dynamic filter selection in convolutionalneural network via a dedicated global gating network, 2018.[43] Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, and Yingyan Lin. Fractionalskipping: Toward finer-grained dynamic cnn inference. In Thirty-Fourth AAAI Conference on ArtificialIntelligence (accepted), 2020.[44] Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin, and Richard Baraniuk. Energynet:Energy-efficient dynamic inference. NeurIPS workshop, 2018.[45] Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang,and Yingyan Lin. Dual dynamic inference: Enabling more efficient, adaptive and controllable deepinference. arXiv preprint arXiv:1907.04523, 2019.[46] Yingyan Lin, Charbel Sakr, Yongjune Kim, and Naresh Shanbhag. Predictivenet: An energy-efficientconvolutional neural network via zero prediction. In Proceedings of ISCAS, 2017.[47] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. Prunetrain:Fast neural network training by dynamic sparse model reconfiguration, 2019.[48] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: Asurvey. Optimization for Machine Learning, 2010(1-38):3, 2011.[49] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in neuralinformation processing systems, pages 46–54, 2016.[50] Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neuralnetworks: Tricks of the trade, pages 437–478. Springer, 2012.[51] Benjamin Recht and Christopher Re. Beneath the valley of the noncommutative arithmetic-geometric meaninequality: conjectures, case-studies. Technical report, and consequences. Technical report, University ofWisconsin-Madison, 2012.[52] Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradientdescent. arXiv preprint arXiv:1510.08560, 2015.[53] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradientfor tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.[54] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak PeterTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprintarXiv:1609.04836, 2016.[55] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochasticgradients. arXiv preprint arXiv:1803.05999, 2018.[56] Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importancesampling. In Advances in Neural Information Processing Systems, pages 7265–7275, 2018.[57] Cody Coleman, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec,and Matei Zaharia. Select via proxy: Efficient data selection for training deep networks, 2019.[58] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal? CoRR, abs/1902.01996,2019.[59] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relativelyshallow networks, 2016.[60] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrollediterative estimation. ICLR, 2017.[61] Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE internationalsolid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014.
62] Charbel Sakr, Yongjune Kim, and Naresh Shanbhag. Analytical guarantees on numerical precision ofdeep neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th InternationalConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3007–3016, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.[63] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technicalreport, Citeseer, 2009.[64] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.[65] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:Inverted residuals and linear bottlenecks, 2018.[66] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In Proceedings of the IEEE international conferenceon computer vision, pages 1026–1034, 2015.[67] Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and ChristopherDe Sa. Swalp: Stochastic weight averaging in low-precision training. arXiv preprint arXiv:1904.11943,2019.[68] Xilinx Inc. Digilent ZedBoard Zynq-7000 ARM/FPGA SoC Development Board. , 2019. [Online; accessed 20-May-2019].[69] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochasticdepth. In European conference on computer vision, pages 646–661. Springer, 2016.[70] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4510–4520, 2018.[71] Doyen Sahoo, Quang Pham, Jing Lu, and Steven CH Hoi. Online deep learning: Learning deep neuralnetworks on the fly. arXiv preprint arXiv:1711.03705, 2017., 2019. [Online; accessed 20-May-2019].[69] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochasticdepth. In European conference on computer vision, pages 646–661. Springer, 2016.[70] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4510–4520, 2018.[71] Doyen Sahoo, Quang Pham, Jing Lu, and Steven CH Hoi. Online deep learning: Learning deep neuralnetworks on the fly. arXiv preprint arXiv:1711.03705, 2017.