Transparent FPGA Acceleration with TensorFlow
TTransparent FPGA Acceleration with TensorFlow
Simon Pfenning ∗ , Philipp Holzinger ∗ and Marc Reichenbach ∗∗ Department of Computer Science, Chair of Computer ArchitectureFriedrich-Alexander-University Erlangen-Nuremberg, Germany { simon.pfenning, philipp.holzinger, marc.reichenbach } @fau.de Abstract —Today, artificial neural networks are one of themajor innovators pushing the progress of machine learning. Thishas particularly affected the development of neural networkaccelerating hardware. However, since most of these architecturesrequire specialized toolchains, there is a certain amount ofadditional effort for developers each time they want to make useof a new deep learning accelerator. Furthermore the flexibilityof the device is bound to the architecture itself, as well as to thefunctionality of the runtime environment.In this paper we propose a toolflow using TensorFlow asfrontend, thus offering developers the opportunity of using afamiliar environment. On the backend we use an FPGA, whichis addressable via an HSA runtime environment. In this waywe are able to hide the complexity of controlling new hardwarefrom the user, while at the same time maintaining a high amountof flexibility. This can be achieved by our HSA toolflow, sincethe hardware is not statically configured with the structure ofthe network. Instead, it can be dynamically reconfigured duringruntime with the respective kernels executed by the network andsimultaneously from other sources e.g. OpenCL/OpenMP.
Index Terms —TensorFlow, Deep Learning, FPGA
I. I
NTRODUCTION
Modern AI algorithms based on the concept of artificialneural networks (ANN) have significantly expanded the capa-bilities of machine learning (ML). Recently, there have beenincreasing efforts to make those achievements accessible tomobile devices as well. However, the power consumption inthis environment is severely restricted which makes highlyenergy-efficient designs inevitable. For this reason, throughputoptimized GPGPUs as the past driving force of ML are notable to similarly dominate there in the long term. Instead,application-specific architectures have an increasingly impor-tant role already now. Although current developments likeGoogle’s TPU are dedicated ASIC solutions, FPGAs are alsogaining in significance. Their reconfigurability gives them theflexibility to adapt to a wide range of tasks while retaininga high efficiency. With this capability they can not onlyaccelerate the neural networks themselves but also pre- andpost-processing as well as sensor fusion which can highlydiffer between applications.Even though previous FPGA approaches have alreadyshown advantages over traditional general purpose architec-tures, they are not able to fully utilize them in a completeheterogeneous system. They usually claim the hardware ex-clusively for the ANN or require own frameworks and devel-opment flows. This leads to the situation where applicationdevelopers, who want to use the new hardware capabilities,often need to diverge from their already established workflow.
ApplicationPreprocessing DNN PostprocessingOpenCL / OpenMPRuntime HSA-RuntimeCPU GPUFPGA H a r d w a r e our approach R un t i m e U s e r P r o g r a m OpenCL / OpenMPRuntime
Fig. 1: Mapping of DL applications with our toolchain.Furthermore, the hardware might not be utilized to its fullest,since the functionality is limited to the processing of certainnetwork types.Hence, in this paper we introduce a new method for makinguse of hardware accelerators in the field of deep learning (DL)without requiring a considerable amount of additional effortand yet with a high degree of flexibility. Since TensorFlow(TF) is one of the most commonly used DL frameworks, wedecided to use it as the frontend of our toolflow. In orderto achieve as much flexibility with the system as possible,the underlying methodology is based on the HSA Foundationstandard [1]. This standard is delivering a common way forcontrolling conforming devices like GPUs, CPUs or DSPs.II. R
ELATED W ORK
There are similar approaches which tried to synthesize staticnetlists out of TF for FPGAs. LeFlow [2] uses the TF internalcompiler to obtain a static compute graph of the network,which is then transferred via high level synthesis into an FPGAspecific netlist. Xilinx’s Vitis AI [3] framework takes a similarway, by analyzing and optimizing the DL model with theirown AI Compiler and transferring the result to the Vitis AIRuntime, which is deploying the work to FPGA accelerators.Contrary to these static procedures, our method based on theprinciples described in [4], [5] does not statically map themodel onto the FPGA as a whole. Instead, it utilizes theTF runtime for issuing the workload as pars pro toto. Thisallows for a more flexible use of the FPGA, not solely for theexecution of a specific network model.III. C
ONCEPT
Figure 1 shows the mapping of applications to dedicatedhardware with our proposed approach. In contrast to other so-lutions our concept refrains from using a secondary toolchain
Copyright is held by the author/owner(s).DATE Friday Workshop on
System-level Design Methods for Deep Learning onHeterogeneous Architectures (SLOHA 2021) , Virtual Workshop, February 5, 2021. ABLE I: Utilization of the Programmable Logic
Kernel LUTs FFs BRAM DSPsShell 9915 (14.1%) 8544 (6.1%) 10 (4.6%) 0 (0.0%)Role 1 9984 (14.1%) 8479 (6.0%) 21 (9.7%) 22 (6.1%)Role 2 9501 (13.5%) 7851 (5.6%) 23 (10.6%) 8 (2.2%)Role 3 5091 (7.2%) 4935 (3.5%) 21 (9.7%) 6 (1.7%)Role 4 7881 (11.2%) 7926 (5.6%) 21 (9.7%) 12 (3.3%) which processes the frozen graph for the FPGA. Insteadeverything needed is completely integrated into TF itself andcan be utilized by the same Python/C++ calls developers arealready familiar with.However, applications rarely only contain procedures pro-vided by a DL framework. Instead, they are usually dividedinto the network inference itself and several external pre-and post-processing steps e.g. for data acquisition and sensorfusion. Therefore, it is the responsibility of a comprehensivetoolflow to forster these common use cases. Our proposedconcept combines these two aspects by abstracting the lowlevel details with a common standard [1] implemented in the
HSA Runtime and its associated drivers. It manages all HSAdevices in the system, informs its users about the status ofthe underlying hardware and synchronizes dispatched tasks.The necessary HSA runtime calls can be generated either by astandard OpenCL/OpenMP compiler or the TF framework. Forthis purpose, the TF runtime has been extended by a respectivedevice backend. It detects and manages all the accessible HSAdevices visible to the framework. By using an annotation intheir Python- or C-Code, developers can induce to executeoperations on certain device-types. If TF is able to find aregistered kernel implementation for HSA devices it will bedispatched using HSA runtime calls which are made availableto TF by our extension.In principle this is not different to the method for anyother accelerator. The major difference here is, in case ofan FPGA there are two options of what a registered kernelcan be. The simple and most flexible solution would be anOpenCL implementation. During inference this would resultin compilation of an intermediate format shared by all HSAdevices. After a runtime synthesis the device specific bitstreamis generated and deployed to the FPGA.The great advantage of this method lies in its flexibility,since not only the synthesis target can be changed duringruntime, but also the same OpenCL kernel can be used forvarious types of HSA accelerators. On the downside, thisapproach leads to a significant increase in runtime and energycosts. Especially due to online synthesis. As we want to focuson a mobile use case, we found it a better solution to registerpresynthesized bitstreams as kernels for TF, which are thendeployed during runtime and used for partial reconfigurationof the FPGA. This way we still maintain a high degree offlexibility without having to suffer from the disadvantage ofhighly increased energy consumption. Since the FPGA isnot configured once with a static network structure, but itis dynamically reconfigured for each kernel call, it is notmonopolized by the network and can be used for other tasks TABLE II: Overhead of FPGA TensorFlow [µ s ] (n=1000) Operation Occurrence TensorFlow HSA Runtimedevice/kernel setup once 156230 39032reconfiguration if not configured 0 7424dispatch latency every dispatch 27 10
TABLE III: Efficiency benefit compared to CPU (n=1000)
Role 1 Role 2 Role 3 Role 4OP/cycle increase 6.51 × × × × like pre- and post-processing steps.IV. E VALUATION
We ran a preliminary implementation of our concept onan Ultra96 board. Since not all parts are conclusively imple-mented, the measurements are not to be considered complete,but give a first impression. Table I depicts the FPGA resourceutilization of the shell and several layer variants:1) Fully connected (float32)2) Fully connected with barrier (float32)3) Conv 5 ×
5, 1 filter, fixed weights (int16)4) Conv 3 ×
3, 2 filters, fixed weights (int16)The overhead caused by our TensorFlow-HSA approach islisted in table II. At the beginning all device managementmechanisms are set up for kernel dispatches. This delaytherefore occurs only once during execution. Reconfigurationis automatically handled by the runtime and happens everytime when a kernel that is not currently loaded on the FPGAis executed. In this process a LRU eviction scheme is usedif more roles than available regions need to be handled. TFcan consider this trade-off to either generate a lower numberof generic roles or fix layer weights to have more efficienthardware. Finally, the kernels can be dispatched with a lowlatency as often as needed. Our first measurements shown intable III already demonstrate an improvement of up to . × over a plain ARM Cortex A53 implementation. In the futurewe will further optimize the components to be comparable tostate-of-the-art approaches.A CKNOWLEDGMENT
This work is a result of the project ”KI-Flex” (projectnumber 16ES1027), funded by the German Federal Ministry ofEducation and Research (BMBF) within the founding programMicroelectronic from Germany innovation driver.R
Vitis AI User Documentation
J. Sign. Process.Syst. , vol. 91, pp. 745–757, May 2018.[5] P. Holzinger, M. Reichenbach, and D. Fey, “A New Generic HLSApproach for Heterogeneous Computing: On the Feasibility of High-levelSynthesis in HSA-compatible Systems,” in
Proc. 18th Int. Conf. Embed.Comput. Syst.: Arch., Model. and Simul. , 2018, pp. 18–27., 2018, pp. 18–27.