A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices
Xiaofan Zhang, Cong Hao, Yuhong Li, Yao Chen, Jinjun Xiong, Wen-mei Hwu, Deming Chen
AA Bi-Directional Co-Design Approach to Enable Deep Learningon IoT Devices
Xiaofan Zhang Cong Hao Yuhong Li Yao Chen Jinjun Xiong Wen-mei Hwu Deming Chen
Abstract
Developing deep learning models for resource-constrained Internet-of-Things (IoT) devices ischallenging, as it is difficult to achieve both goodquality of results (QoR), such as DNN model in-ference accuracy, and quality of service (QoS),such as inference latency, throughput, and powerconsumption. Existing approaches typically sep-arate the DNN model development step from itsdeployment on IoT devices, resulting in subopti-mal solutions. In this paper, we first introduce afew interesting but counterintuitive observationsabout such a separate design approach, and empir-ically show why it may lead to suboptimal designs.Motivated by these observations, we then proposea novel and practical bi-directional co-design ap-proach: a bottom-up DNN model design strategytogether with a top-down flow for DNN accel-erator design. It enables a joint optimization ofboth DNN models and their deployment configu-rations on IoT devices as represented as FPGAs.We demonstrate the effectiveness of the proposedco-design approach on a real-life object detectionapplication using Pynq-Z1 embedded FPGA. Ourmethod obtains the state-of-the-art results on bothQoR with high accuracy (IoU) and QoS with highthroughput (FPS) and high energy efficiency.
1. Introduction
To enable deep learning capability on IoT devices, thereare two major components to be designed: the software,e.g., DNN models featuring parameters through learningfor specific applications, and the hardware, such as DNNaccelerators running on GPUs, FPGAs, or ASICs. Bothof them contribute to the overall QoR and QoS withoutclear distinctions, so there is an urgent need of DNN andaccelerator co-design.
Typically, DNNs and their accelerators are designed andoptimized separately for IoT applications in an iterativemanner. DNNs are first designed with more concentrationson QoR. Such DNNs can be excessively complicated for Department of ECE, University of Illinois Urbana-Champaign,USA Advanced Digital Sciences Center, Singapore IBM T. J.Watson Research Center, USA Inspirit IoT, Inc., USA. Correspon-dence to: Xiaofan Zhang < [email protected] > . the targeted IoT devices, which must be compressed usingquantization, network pruning, or sparsification (Wang et al.,2018; Han et al., 2017) before implementing on hardware,and then be retrained to maintain inference accuracy. Sinceno hardware constraints are captured during DNN design,this design methodology can only expect hardware accel-erators to deliver good QoS through later optimizations onhardware prospects. On the other hand, the DNN acceler-ator design usually uses a consistent overall architecture(such as the recurrent (Aydonat et al., 2017; Zeng et al.,2018; Jouppi et al., 2017) or pipelined structure (Li et al.,2016; Zhang et al., 2018)) but various scale-down factorsto meet different hardware constraints. When facing stricthardware constraints, scaling-down the accelerator is notalways feasible as the shrinking resources can significantlyslow down the DNN inference process and result in poorQoS. Design opportunities must turn to the algorithm sideand ask for more compact DNN models.
2. Empirical observations
One of the most fundamental barriers blocking the DNNand accelerator design is the different sensitivities ofDNN/accelerator configurations (e.g., DNN model size,hardware utilization features). It is hard to balance theseconfigurations using separated DNN/accelerator design ap-proach, since a negligible change in DNN models may causehuge differences in hardware accelerators and vice versa,resulting in difficult trade-off between QoR and QoS.
Observation 1: similar compression rate but different ac-curacy.
When designing DNNs for IoT applications, it isinevitable to perform model compression. Although theoverall QoS may be the same for a DNN with similar com-pression rates, the compression of different DNN compo-nents may cause great differences in QoR. As shown inFig. 1 (a), the accuracy trends vary significantly for quan-tizing parameters and intermediate feature maps (FMs).In this figure, the coordinates of the bubble center repre-sent accuracy and model compression rate, while the areashows data size in megabyte (MB). We scale-up the bub-ble size of FM for better graphic effect. By compress-ing the model from full-precision (float32) to 8-bit, 4-bitfixed point, ternary and binary representations, we reduce22X parameter size (237.9MB → → a r X i v : . [ c s . C V ] M a y he parameters (4.8% accuracy drop with 22X compres-sion). Challenges also come from the difficulty of DNNtraining. As shown in Fig. 1 (b), the accuracy growth ofcompressed model is quite unstable compared to the origi-nal full-precision model. It requires more efforts to designthe training process (e.g., fine-tuning the training set-up oriteratively modifying the DNN compression rate) and morepowerful machines (e.g., computer cluster for faster training(Li et al., 2018)). Observation 2: similar accuracy but different hardware re-source utilization.
DNN models with similar QoR may alsoresult in greatly different QoS because of different hardwareresource usage. Taking the implementation of a DNN ac-celerator on FPGA as an example, a single bit differencein data representation may result in considerable impactson hardware resource utilization. Fig. 2 (a) shows BRAM(on-chip memory in FPGA) usage under different imageresize factors with 12 ∼ < When deploying DNNs on IoT devices, it is common to firstfind a DNN with desired QoR upper-bound for the targetedapplication, and then to prune the DNN to make up for thelost QoS on hardware. This solution assumes that compli-cated DNNs with more parameters always deliver higherQoR than simple DNNs with less parameters. However,it is not always true. By examining a UAV-based objectdetection task (DAC, 2018), we observe an abnormal trendregarding model size and QoR upper-bound (Table 1), whereDNNs with more parameters fail to deliver higher accuracyafter adequate training. This implies that the current sepa-rated DNN/accelerator design may only reach suboptimalsolutions, and requires more time and efforts for iterativerefining before delivering prefect QoR and QoS.
3. The Proposed Bi-Directional Co-Design
Motivated by the discussed observations, we propose abi-directional co-design methodology with a bottom-uphardware-oriented DNN design, and a top-down acceler-
Figure 1. (a) Accuracy trends of AlexNet inference in ImageNetdataset during parameter (blue) and feature map (green) compres-sion with retraining. Model name is donated as precision p forFM, p for the 1st CONV, p for the 2nd ∼ p forthe 1st ∼ p for the 3rd FC in p - p p p p format;(b) Training of ResNet-20 in Cifar10 dataset using ADMM withfull-precision (blue) and quantized (green) FMs and parameters. Figure 2. (a) BRAM usages of accelerators with the same architec-ture but 12 ∼ ∼ FM16)and different image resize factors. (b) DSP utilization of accelera-tor using different quantizations between weights (W) and featuremaps (FMs) with the numbers indicating bits allocated. ator design considering DNN-specific characteristics. BothDNNs and accelerators are designed simultaneously to pur-sue the best trade-off between QoS and QoR. The overallflow of the proposed co-design is shown in Fig. 3. Theinputs of this flow include the targeted QoS, QoR, and thehardware resource constraints; the outputs include the gen-erated DNN model and its corresponding accelerator design.We break down the whole flow into three steps:
Step1: Bundle construction and QoS evaluation.
We ran-domly select DNN components from the layer pool andconstruct bundles (as basic building blocks of generatedDNNs) with different layer combinations. Each of the bun-dle is evaluated by analytical models to capture the hardwarecharacteristics (e.g., latency, computation and memory de-mands, resource utilization), which allows QoS estimationin the early stage during DNN exploration.
Step2: QoR- and QoS-based bundle selection.
To select themost promising bundles, we first evaluate the QoR potentialof each bundle by replicating such bundle n time to construct igure 3. The proposed bi-directional co-design with a bottom-up DNN model exploration and a top-down accelerator design approach.For DNN exploration, we start using the hardware-aware building templates (called Bundles), and grow the DNN to reach desired QoR;For accelerator design, we follow the proposed architecture using bundle-reused tile-based pipeline, and optimize configurable parametersto pursue the targeted QoS.
Table 1.
DNNs for single object detection for 3 × ×
360 inputimages using different backbones listed (without fully-connectedlayers) but the same back-end for bounding box regression.Backbone Para. Size (MB) IoUResNet-18 (He et al., 2016) 85 61%ResNet-32 (He et al., 2016) 162 26%ResNet-50 (He et al., 2016) 179 32%VGG-16 (Simonyan et al., 2014) 56 25% a prototype DNN. All prototype DNNs are fast trained (20epochs) directly on the targeted dataset for accuracy results.Based on the QoS estimation in step1 , we group prototypeDNNs with similar QoS to the input targets and select thetop- n bundle candidates of each group. Step3: Hardware-aware DNN exploration.
By stacking theselected bundle, we start exploring DNNs with a bottom-upapproach under given QoS and QoR constraints by usingstochastic coordinate descent (SCD). DNNs output fromSCD are precisely evaluated regarding their QoS and fedback to SCD for DNN model update. The generated DNNsthat meet QoS targets are output for training and fine-tuningto have improved QoR.We propose a DNN accelerator which provides a tile-basedpipelined architecture for efficient implementation of DNNapplications with maximum resource sharing strategy. Itincludes a folded structure to compute DNN bundles sequen-tially by reusing the same hardware computing componentsfor resource saving when targeting compact IoT devices.To ensure better QoS, it also uses an unfolded structure forcomputing operations (partitioned by tiles) inside bundlesin a pipelined manner. With the combination of folded andunfolded structure, the proposed architecture can acquireadvantages from both recurrent and pipelined structure.
4. Results and Conclusions
We demonstrate the proposed bi-directional co-design ona real-life object detection task in DAC’18 System DesignContest and generate three DNNs (A, B, and C in Table 2)and corresponding accelerators on Pynq-Z1 FPGA for differ-ent QoS-QoR combinations. The proposed co-design flow
Table 2.
The proposed DNNs with different data precisions for W eight and F eature map. The convolutional layers include depth-wise (DW) 3 × × × ×
360 color image)DW-Conv3 (3)PW-Conv1 (48)2 × × × Table 3.
Result comparisons to the champion design in FPGA andGPU track of DAC’18 System Design Contest (DAC, 2018)Model IoU FPS EfficiencyThe proposed DNN-A 59.3% image/wattThe proposed DNN-B 61.2% 22.7 9.46 image/wattThe proposed DNN-C 68.6% 17.4 6.96 image/wattModified SSD (FPGA) 62.4% 12.0 2.86 image/wattModified Yolo (GPU) first identifies that the bundle with DW-Conv3, PW-Conv1,and max-pooling layers is the most promising building tem-plate for the target hardware device and application. Basedon this bundle, the co-design explores three DNN configura-tions with different quantization schemes to satisfy the QoRdemands, respectively. As shown in Table 3, we can deliverthe best FPS (29.7) and efficiency (12.38 image/watt) usingthe same FPGA as the FPGA champion design. Amongthem, the proposed DNN-C outperforms the FPGA winningdesign in all aspects with 6.2% higher IoU, 1.45X higherFPS, and 2.4X higher efficiency. Comparing to the GPUwinning design, the DNN-C design can deliver comparableaccuracy but 3.6X higher efficiency. cknowledgment
This work was partly supported by the IBM-Illinois Cen-ter for Cognitive Computing System Research (C SR) – aresearch collaboration as part of IBM AI Horizons Network.
References
DAC System Design Contest. https://github.com/xyzxinyizhang/2018-DAC-System-Design-Contest , 2018.Aydonat, U., O’Connell, S., Capalija, D., Ling, A. C., andChiu, G. R. An Opencl deep learning accelerator on Arria10. In
Proceedings of the 2017 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays , pp. 55–64. ACM, 2017.Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D.,Luo, H., Yao, S., Wang, Y., et al. Ese: Efficient speechrecognition engine with sparse lstm on fpga. In
Proceed-ings of the 2017 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays , pp. 75–84. ACM,2017.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,A., et al. In-datacenter performance analysis of a tensorprocessing unit. In ,pp. 1–12. IEEE, 2017.Li, H., Fan, X., Jiao, L., Cao, W., Zhou, X., and Wang, L.A high performance FPGA-based accelerator for large-scale convolutional neural networks. In , pp. 1–9. IEEE, 2016.Li, Y., Yu, M., Li, S., Avestimehr, S., Kim, N. S., andSchwing, A. Pipe-SGD: A Decentralized Pipelined SGDFramework for Distributed Deep Net Training. In
Pro-ceedings of the 32nd Conference on Neural InformationProcessing Systems (NIPS’18) , Montreal, Canada, De-cember 2018.Simonyan, K. et al. Very deep convolutional networksfor large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014.Wang, J., Lou, Q., Zhang, X., Zhu, C., Lin, Y., and Chen,D. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In , pp. 163–1636. IEEE, 2018.Zeng, H., Chen, R., Zhang, C., and Prasanna, V. A frame-work for generating high throughput cnn implementationson FPGAs. In
Proceedings of the 2018 ACM/SIGDAInternational Symposium on Field-Programmable GateArrays , pp. 117–126. ACM, 2018.Zhang, X., Wang, J., Zhu, C., Lin, Y., Xiong, J., Hwu, W.-m., and Chen, D. DNNBuilder: an automated tool forbuilding high-performance DNN hardware acceleratorsfor FPGAs. In