[PDF] Transparent FPGA Acceleration with TensorFlow

Abstract

Today, artificial neural networks are one of the major innovators pushing the progress of machine learning. This has particularly affected the development of neural network accelerating hardware. However, since most of these architectures require specialized toolchains, there is a certain amount of additional effort for developers each time they want to make use of a new deep learning accelerator. Furthermore the flexibility of the device is bound to the architecture itself, as well as to the functionality of the runtime environment. In this paper we propose a toolflow using TensorFlow as frontend, thus offering developers the opportunity of using a familiar environment. On the backend we use an FPGA, which is addressable via an HSA runtime environment. In this way we are able to hide the complexity of controlling new hardware from the user, while at the same time maintaining a high amount of flexibility. This can be achieved by our HSA toolflow, since the hardware is not statically configured with the structure of the network. Instead, it can be dynamically reconfigured during runtime with the respective kernels executed by the network and simultaneously from other sources e.g. OpenCL/OpenMP.

Full PDF

TTransparent FPGA Acceleration with TensorFlow

Simon Pfenning ∗ , Philipp Holzinger ∗ and Marc Reichenbach ∗∗ Department of Computer Science, Chair of Computer ArchitectureFriedrich-Alexander-University Erlangen-Nuremberg, Germany { simon.pfenning, philipp.holzinger, marc.reichenbach } @fau.de Abstract —Today, artiﬁcial neural networks are one of themajor innovators pushing the progress of machine learning. Thishas particularly affected the development of neural networkaccelerating hardware. However, since most of these architecturesrequire specialized toolchains, there is a certain amount ofadditional effort for developers each time they want to make useof a new deep learning accelerator. Furthermore the ﬂexibilityof the device is bound to the architecture itself, as well as to thefunctionality of the runtime environment.In this paper we propose a toolﬂow using TensorFlow asfrontend, thus offering developers the opportunity of using afamiliar environment. On the backend we use an FPGA, whichis addressable via an HSA runtime environment. In this waywe are able to hide the complexity of controlling new hardwarefrom the user, while at the same time maintaining a high amountof ﬂexibility. This can be achieved by our HSA toolﬂow, sincethe hardware is not statically conﬁgured with the structure ofthe network. Instead, it can be dynamically reconﬁgured duringruntime with the respective kernels executed by the network andsimultaneously from other sources e.g. OpenCL/OpenMP.

Index Terms —TensorFlow, Deep Learning, FPGA

I. I

NTRODUCTION

Modern AI algorithms based on the concept of artiﬁcialneural networks (ANN) have signiﬁcantly expanded the capa-bilities of machine learning (ML). Recently, there have beenincreasing efforts to make those achievements accessible tomobile devices as well. However, the power consumption inthis environment is severely restricted which makes highlyenergy-efﬁcient designs inevitable. For this reason, throughputoptimized GPGPUs as the past driving force of ML are notable to similarly dominate there in the long term. Instead,application-speciﬁc architectures have an increasingly impor-tant role already now. Although current developments likeGoogle’s TPU are dedicated ASIC solutions, FPGAs are alsogaining in signiﬁcance. Their reconﬁgurability gives them theﬂexibility to adapt to a wide range of tasks while retaininga high efﬁciency. With this capability they can not onlyaccelerate the neural networks themselves but also pre- andpost-processing as well as sensor fusion which can highlydiffer between applications.Even though previous FPGA approaches have alreadyshown advantages over traditional general purpose architec-tures, they are not able to fully utilize them in a completeheterogeneous system. They usually claim the hardware ex-clusively for the ANN or require own frameworks and devel-opment ﬂows. This leads to the situation where applicationdevelopers, who want to use the new hardware capabilities,often need to diverge from their already established workﬂow.

ApplicationPreprocessing DNN PostprocessingOpenCL / OpenMPRuntime HSA-RuntimeCPU GPUFPGA H a r d w a r e our approach R un t i m e U s e r P r o g r a m OpenCL / OpenMPRuntime

Fig. 1: Mapping of DL applications with our toolchain.Furthermore, the hardware might not be utilized to its fullest,since the functionality is limited to the processing of certainnetwork types.Hence, in this paper we introduce a new method for makinguse of hardware accelerators in the ﬁeld of deep learning (DL)without requiring a considerable amount of additional effortand yet with a high degree of ﬂexibility. Since TensorFlow(TF) is one of the most commonly used DL frameworks, wedecided to use it as the frontend of our toolﬂow. In orderto achieve as much ﬂexibility with the system as possible,the underlying methodology is based on the HSA Foundationstandard [1]. This standard is delivering a common way forcontrolling conforming devices like GPUs, CPUs or DSPs.II. R

ELATED W ORK

There are similar approaches which tried to synthesize staticnetlists out of TF for FPGAs. LeFlow [2] uses the TF internalcompiler to obtain a static compute graph of the network,which is then transferred via high level synthesis into an FPGAspeciﬁc netlist. Xilinx’s Vitis AI [3] framework takes a similarway, by analyzing and optimizing the DL model with theirown AI Compiler and transferring the result to the Vitis AIRuntime, which is deploying the work to FPGA accelerators.Contrary to these static procedures, our method based on theprinciples described in [4], [5] does not statically map themodel onto the FPGA as a whole. Instead, it utilizes theTF runtime for issuing the workload as pars pro toto. Thisallows for a more ﬂexible use of the FPGA, not solely for theexecution of a speciﬁc network model.III. C

ONCEPT

Figure 1 shows the mapping of applications to dedicatedhardware with our proposed approach. In contrast to other so-lutions our concept refrains from using a secondary toolchain

System-level Design Methods for Deep Learning onHeterogeneous Architectures (SLOHA 2021) , Virtual Workshop, February 5, 2021. ABLE I: Utilization of the Programmable Logic

Kernel LUTs FFs BRAM DSPsShell 9915 (14.1%) 8544 (6.1%) 10 (4.6%) 0 (0.0%)Role 1 9984 (14.1%) 8479 (6.0%) 21 (9.7%) 22 (6.1%)Role 2 9501 (13.5%) 7851 (5.6%) 23 (10.6%) 8 (2.2%)Role 3 5091 (7.2%) 4935 (3.5%) 21 (9.7%) 6 (1.7%)Role 4 7881 (11.2%) 7926 (5.6%) 21 (9.7%) 12 (3.3%) which processes the frozen graph for the FPGA. Insteadeverything needed is completely integrated into TF itself andcan be utilized by the same Python/C++ calls developers arealready familiar with.However, applications rarely only contain procedures pro-vided by a DL framework. Instead, they are usually dividedinto the network inference itself and several external pre-and post-processing steps e.g. for data acquisition and sensorfusion. Therefore, it is the responsibility of a comprehensivetoolﬂow to forster these common use cases. Our proposedconcept combines these two aspects by abstracting the lowlevel details with a common standard [1] implemented in the

HSA Runtime and its associated drivers. It manages all HSAdevices in the system, informs its users about the status ofthe underlying hardware and synchronizes dispatched tasks.The necessary HSA runtime calls can be generated either by astandard OpenCL/OpenMP compiler or the TF framework. Forthis purpose, the TF runtime has been extended by a respectivedevice backend. It detects and manages all the accessible HSAdevices visible to the framework. By using an annotation intheir Python- or C-Code, developers can induce to executeoperations on certain device-types. If TF is able to ﬁnd aregistered kernel implementation for HSA devices it will bedispatched using HSA runtime calls which are made availableto TF by our extension.In principle this is not different to the method for anyother accelerator. The major difference here is, in case ofan FPGA there are two options of what a registered kernelcan be. The simple and most ﬂexible solution would be anOpenCL implementation. During inference this would resultin compilation of an intermediate format shared by all HSAdevices. After a runtime synthesis the device speciﬁc bitstreamis generated and deployed to the FPGA.The great advantage of this method lies in its ﬂexibility,since not only the synthesis target can be changed duringruntime, but also the same OpenCL kernel can be used forvarious types of HSA accelerators. On the downside, thisapproach leads to a signiﬁcant increase in runtime and energycosts. Especially due to online synthesis. As we want to focuson a mobile use case, we found it a better solution to registerpresynthesized bitstreams as kernels for TF, which are thendeployed during runtime and used for partial reconﬁgurationof the FPGA. This way we still maintain a high degree ofﬂexibility without having to suffer from the disadvantage ofhighly increased energy consumption. Since the FPGA isnot conﬁgured once with a static network structure, but itis dynamically reconﬁgured for each kernel call, it is notmonopolized by the network and can be used for other tasks TABLE II: Overhead of FPGA TensorFlow [µ s ] (n=1000) Operation Occurrence TensorFlow HSA Runtimedevice/kernel setup once 156230 39032reconﬁguration if not conﬁgured 0 7424dispatch latency every dispatch 27 10

TABLE III: Efﬁciency beneﬁt compared to CPU (n=1000)

Role 1 Role 2 Role 3 Role 4OP/cycle increase 6.51 × × × × like pre- and post-processing steps.IV. E VALUATION

We ran a preliminary implementation of our concept onan Ultra96 board. Since not all parts are conclusively imple-mented, the measurements are not to be considered complete,but give a ﬁrst impression. Table I depicts the FPGA resourceutilization of the shell and several layer variants:1) Fully connected (ﬂoat32)2) Fully connected with barrier (ﬂoat32)3) Conv 5 ×

5, 1 ﬁlter, ﬁxed weights (int16)4) Conv 3 ×

3, 2 ﬁlters, ﬁxed weights (int16)The overhead caused by our TensorFlow-HSA approach islisted in table II. At the beginning all device managementmechanisms are set up for kernel dispatches. This delaytherefore occurs only once during execution. Reconﬁgurationis automatically handled by the runtime and happens everytime when a kernel that is not currently loaded on the FPGAis executed. In this process a LRU eviction scheme is usedif more roles than available regions need to be handled. TFcan consider this trade-off to either generate a lower numberof generic roles or ﬁx layer weights to have more efﬁcienthardware. Finally, the kernels can be dispatched with a lowlatency as often as needed. Our ﬁrst measurements shown intable III already demonstrate an improvement of up to . × over a plain ARM Cortex A53 implementation. In the futurewe will further optimize the components to be comparable tostate-of-the-art approaches.A CKNOWLEDGMENT

This work is a result of the project ”KI-Flex” (projectnumber 16ES1027), funded by the German Federal Ministry ofEducation and Research (BMBF) within the founding programMicroelectronic from Germany innovation driver.R

Vitis AI User Documentation

J. Sign. Process.Syst. , vol. 91, pp. 745–757, May 2018.[5] P. Holzinger, M. Reichenbach, and D. Fey, “A New Generic HLSApproach for Heterogeneous Computing: On the Feasibility of High-levelSynthesis in HSA-compatible Systems,” in

Proc. 18th Int. Conf. Embed.Comput. Syst.: Arch., Model. and Simul. , 2018, pp. 18–27., 2018, pp. 18–27.