[PDF] Building Application-Specific Overlays on FPGAs with High-Level Customizable IPs

Abstract

Overlays are virtual, re-configurable architectures that overlay on top of physical FPGA fabrics. An overlay that is specialized for an application, or a class of applications, offers both fast reconfiguration and minimized performance penalty. Such an overlay is usually implemented by hardware designers in hardware "assembly" languages at register-transfer level (RTL). This short article proposes an idea for a software programmer, instead of hardware designers, to quickly implement an application-specific overlay using high-level customizable IPs. These IPs are expressed succinctly by a specification language, whose abstraction level is much higher than RTL but can nonetheless expresses many performance-critical loop and data optimizations on FPGAs, and thus would offer competitively high performance at a much lower cost of maintenance and much easier customizations. We propose new language features to easily put the IPs together into an overlay. A compiler automatically implements the specified optimizations to generate an efficient overlay, exposes a multi-tasking programming interface for the overlay, and inserts a runtime scheduler for scheduling tasks to run on the IPs of the overlay, respecting the dependences between the tasks. While an application written in any language can take advantage of the overlay through the programming interface, we show a particular usage scenario, where the application itself is also succinctly specified in the same language. We describe the new language features for expressing overlays, and illustrate the features with an LU decomposer and a convolutional neural network. A system is under construction to implement the language features and workloads.

Full PDF

BBuilding Application-Speciﬁc Overlays onFPGAs with High-Level Customizable IPs

Hongbo Rong

Parallel Computing Lab (PCL), Intel [email protected]

Abstract

Overlays are virtual, re-conﬁgurable architectures thatoverlay on top of physical FPGA fabrics [8]. An overlaythat is specialized for an application, or a class of appli-cations, oﬀers both fast reconﬁguration and minimizedperformance penalty. Such an overlay is usually imple-mented by hardware designers in hardware “assembly”languages at register-transfer level (RTL).This short article proposes an idea for a software pro-grammer , instead of hardware designers, to quickly im-plement an application-speciﬁc overlay using high-levelcustomizable IPs. These IPs are expressed succinctlyby a speciﬁcation language, whose abstraction level ismuch higher than RTL but can nonetheless expressesmany performance-critical loop and data optimizationson FPGAs, and thus would oﬀer competitively high per-formance at a much lower cost of maintenance and mucheasier customizations.We propose new language features to easily put theIPs together into an overlay. A compiler automaticallyimplements the speciﬁed optimizations to generate aneﬃcient overlay, exposes a multi-tasking programminginterface for the overlay, and inserts a runtime schedulerfor scheduling tasks to run on the IPs of the overlay,respecting the dependences between the tasks. While anapplication written in any language can take advantageof the overlay through the programming interface, weshow a particular usage scenario, where the applicationitself is also succinctly speciﬁed in the same language.We describe the new language features for expressingoverlays, and illustrate the features with an LU decom-poser and a convolutional neural network. A system isunder construction to implement the language featuresand workloads.

1. Introduction

An FPGA has massive amount of logical elements thatare distributed, locally connected and running in par-allel, interleaved with memory blocks and often withhardened DSP blocks. The logical elements, intercon- nects, memory and DSP blocks can be synthesized tomatch a dataﬂow compute for the best performance andpower eﬃciency. However, the synthesis time tends tobe very long: even a small design may take tens of min-utes, and a larger design can easily take hours or evendays.Overlays have been proposed to cut down the synthe-sis time. Overlays are virtual, re-conﬁgurable architec-tures that overlay on top of physical FPGA fabrics [8].An overlay usually has (much) coarser granularity, andthus (much) smaller amount, of resources that can be re-conﬁgured. Therefore, the resources of an overlay can besynthesized for a dataﬂow compute at a radically fasterspeed than the traditional hardware synthsis [8, 3, 2].An overlay oﬀers software programmers a software-like programming experience: An overlay is built withhardware IPs on top of an FPGA; the hardware IPshave a higher-abstraction level (e.g. matrix or vectorlevel), and thus programmers can program the overlayat that higher abstraction level instead, reaching higherproductivity at a reasonable performance cost.However, there are remaining problems: • An overlay itself is usually still implemented at RTL,and by hardware experts, with a high developmentcost. • Overlays are often available only for hot domains orapplications (e.g. deep learning these days [4, 6, 1]).Existing overlays might not necessarily well matchnew algorithms, applications or domains.This short article proposes an idea to enable a software programmer , instead of hardware experts, toquickly build an application-speciﬁc overlay on anFPGA using high-level customizable IPs. These IPs aresuccinctly speciﬁed : the dataﬂow of an IP is expressed ina functional notation, followed by a description how toeﬃciently map the dataﬂow onto the spatial FPGA ar-chitecture with many loop and data optimizations, e.g.how to map the dataﬂow onto a systolic array that wellmatches the underlying FPGA architecture and thus iscritical for performance. a r X i v : . [ c s . P L ] S e p he IPs are only speciﬁed, while the detailed im-plementation of the speciﬁed optimizations is left to acompiler. The speciﬁcation language and compiler usedis T2S (Temporal To Spatial) [7]. Our previous work onT2S has proved that a smart compiler can generate eﬃ-cient IPs with a fraction of development time but withcompetitive performance, compared with the same IPsthat are optimized in the same set of optimizations butthe optimizations are implemented manually by expertsin high-level synthesis (HLS) languages [9, 5] .Since the IPs are written at an abstraction level muchhigher than RTL, the IPs require much lower mainte-nance cost, are much easier to customize by softwareprogrammers, and on the hand, with the right set of op-timizations, can oﬀer competitively high performance.The compiler will automatically expose a multi-tasking programming interface for an overlay, and inserta runtime scheduler for scheduling tasks to run on theoverlay, respecting the dependences between the tasks.While an application written in any language can takeadvantage of the overlays through the programming in-terfaces, we show a particular usage scenario, where theapplication itself is also succinctly speciﬁed in the samelanguage.This approach is generally applicable to many appli-cations that have many tasks and the tasks need sharelimited FPGA resources. We will illustrate the approachwith a VGG convolutional neural network and a blockedLU decomposer. We will deﬁne an overlay for each ofthem; each overlay contains a few IPs on an FPGA. Forthe neural network, we map and schedule all the lay-ers to an overlay. For the blocked LU decomposer, wedynamically generate tasks, and schedule them to theother overlay.This article focuses on describing the idea. We arebuilding a prototype to implement the proposed idea,leveraging our current systems [9, 5]. We will report theprogress in future publications.

2. Overall Flow

Fig. 1 shows the overall ﬂow. A programmer speciﬁes adeﬁnition of an overlay. Directed by the speciﬁcation, acompiler automatically links the overlay deﬁnition with We believe that if the compiler is engineered right, the perfor-mance of an IP will be mainly determined by the set of opti-mizations used for the IP, not by whether the optimizations areautomatically implemented or manually implemented . This be-lief has been supported by our current prototypes. Our currentprototypes generates HLS code only, and thus we compare onlywith expert-written HLS code. However, there is no restrictionfor our approach to generate RTL code, which is purely an engi-neering eﬀort. When generating RTL code, we believe the samephenomenon will repeat: IPs speciﬁed in our language and imple-mented in detail by the automatic compiler should exhibit com-petitive performance vs. expert-written RTL code with the sameset of optimizations. We will verify this belief in future. a pre-written runtime system and synthesize them intoa bitstream for an FPGA, and generates a programminginterface for the overlay. The runtime system includescommand queues, a task graph and a scheduler.The overlay generated on the FPGA can be invokedto run by an application written in any language by call-ing the programming interface. A particular interestingscenario is that the application is also written in thesame speciﬁcation language. In Fig. 1, we show that aprogrammer speciﬁes an application to run on the over-lay. The compiler synthesizes the application with theprogramming interface into another bitstream.Then the compiler oﬄoads both the overlay and theapplication to an FPGA. When the programmer invokesthe application to run, the application automaticallygenerates tasks for the runtime system to schedule torun on the overlay. The example application shown inthe ﬁgure is an LU decomposer, which has many tasksof 4 kinds generated during the execution, dispatchedby the runtime to run on the 4 corresponding hardwareIPs in the overlay. We will describe this example in moredetail below.

3. Examples

In this section, we illustrate our idea with an LU decom-poser and VGG convolutional neural network. Insteadof using formal deﬁnitions, we will intuitively and ef-fectively explain our language features through theseexamples.

For a matrix A = (cid:18) A A A A (cid:19) , we would like to de-compose it into A = LU = (cid:18) L L L (cid:19) (cid:18) U U U (cid:19) .Therefore, it is easy to see that A = L U (1) A = L U (2) A = L U (3) A = L U + L U (4)Therefore, A = L U (5) U = L − A (6) L = A U − (7) L U = A − L U (8)We can generalize this example. Suppose the originalsquare matrix A is divided into n ∗ n sqaure blocks, eachblock having m ∗ m elements. The algorithm of blockedLU is shown in Algorithm 1.We vision that a T2S speciﬁcation can be written asshown in Fig. 1. There are 4 hardware IPs: igure 1: The overall ﬂow Algorithm 1:

The blocked LU decomposition al-gorithm. for ( i = 0; i < n ; i ++) Task 0: decompose A ii = L ii U ii Task 1: calculate U i, ( i +1): n = L − ii A i, ( i +1): n Task 2: calculate L ( i +1): n,i = A ( i +1): n,i U − ii Task 3: calculate A ( i +1): n, ( i +1): n -= L ( i +1): n,i U i, ( i +1): n • LU , which accepts a square block A with the sizeof m ∗ m , and decomposes it into matrix L and U ,and store them at the same space of A . Note the diagonal of L contains only 1’s, and thus not stored. • TransformRowPanel , which accepts a row panelwith a number of blocks, each block with the sizeof m ∗ m , and uses the ﬁrst block (corresponding to L ii ) to transform the other blocks, i.e. L − ii A i, ( i +1): n . • TransformColumnPanel , which accepts a columnpanel with a number of blocks, each block withthe size of m ∗ m , and uses the ﬁrst block (corre-sponding to U ii ) to transform the other blocks, i.e. A ( i +1): n,i U − ii . GEMM , which accepts a matrix

C, A, B and co-eﬃcient α, β, γ , and computes C = αC + βA ∗ γB .All the 4 IPs do in-place update: they write their out-puts into the same space of their inputs.In Fig. 1, the speciﬁcations use several features newto the T2S language: • The

Overlay type is a container for the IPs andruntime system. • F.command(queueNo, parameters) speciﬁes a pro-gramming interface for Func F : the command queueand the parameters. • O.enequeue(queueNo, parameters) is to enqueuea command to the given command queue of theoverlay O with the given parameters. • T1.depend(T2, d, [condition]) says that underan optional condition , task T1 in the current itera-tion depends on task T2 in d iterations before. • A.BCropped(m, startRow, endRow, startCol,endCol) means to crop, in blocks of m ∗ m , from abuﬀer A , from the given start to end row (included),and from the given start to end column (included).The cropping is in-place, and thus the croppedbuﬀer shares the space with the original buﬀer.We can explain Fig. 1 in more detail. A software pro-grammer writes two speciﬁcations, one for the overaly,and the other for the application (i.e. LU decomposer).In the speciﬁcation of the overlay, Line 1-3 declarethe 4 IPs on an (FPGA) device. Line 4-7 declare the in-puts of the IPs. Line 8 deﬁnes the IPs. We assume thatthe IPs have already been speciﬁed with necessary opti-mizations in the T2S language by experts, and are pro-vided to the programmer as a library of building blocks.Therefore, we skip the details of the deﬁnitions of theIPs here. Line 9-12 deﬁne a programming interface foreach IP. Each is driven by a command queue, which isautomatically provided by a runtime system. Finally,Line 13-14 put the IPs into an overlay, and compile theoverlay to a named bitstream.In the speciﬁcation of the application, Line 1-2 de-clare the matrix to be decomposed, and 4 kinds oftasks corresponding to the 4 IPs. Line 3-9 deﬁnes somemacros that are only for the convenience of usage next.Line 10 oﬄoads the overlay’s bitstream to an FPGA,if not yet, and returns a handle. Line 11-14 generate 4tasks and enqueue them into the command queues of thecorresponding IPs. Note that there is an implicit loop i around the tasks. In this way, Algorithm 1 is expressed.Linie 15-18 specify the dependences between the tasks.Line 19-22 set up the input matrix, compile the appli-cation into a bitstream, and run it on the FPGA.The two speciﬁcations are compiled to run on thesame FPGA. The compiler will automatically generate Figure 2: A design for VGG networkfor the overlay speciﬁcation a programming interface,which is used for compiling the application speciﬁcation.A runtime system is automatically linked to the over-lay by the compiler. The runtime system is composedof command queues and a task graph and scheduler.Each IP has a command queue containing tasks to beexecuted. The dependences between any two tasks arerepresented by a task graph and managed by a sched-uler dynamically. How to write such a runtime systemis a known technique. A design for VGG is shown in Fig 2. There is an over-lay and an application on an FPGA. The overlay has2 hardware IPs: Convolution and Maxpool. All con-volution layers (with and without ReLU) and fully-connected (FC) layers can be computed by the Con-volution IP, and all the max pooling layers can be com-puted by the Maxpool IP. The feature map between twolayers can be communicated by external DDR, or by anon-chip feature buﬀer. Inside a layer, the ConvolutionIP has a weight buﬀer.Algorithm 2 shows for VGG two speciﬁcations, fol-lowing the same principle for the previous LU example.We leave a detailed explanation to the comments there.

4. Conclusion and Future Work

We have proposed an idea for a software programmerto quickly build an application-speciﬁc overlay on anFPGA, using high-level customizable IPs. We have il-lustrated the idea with LU decomposition and VGGconvolutional neural network. We are building a systemto implement the proposed idea, leveraging our previ-ous work on T2S. We will report the progress in futurepublications. lgorithm 2: Example speciﬁcations for VGG. /* Specification 1: Define an overlay */ Func Convolution(Place::Device), Maxpool(Place::Device); // Two HW IPs on the device (FPGA) ImageParam X(Float(32), 3), Y(Float(32), 3); // Input and output feature map ImageParam W(Float(32), 4); // Weights Expr read_input_from_buffer, store_output_to_buffer, // Control signals to reconfigure with_ReLU, is_FC_layer; // the overlay. Functional notations and spatial mapping for the Funcs // Expressible in known state-of- // art [9, 5]. Details skipped. // Define a programming interface for each HW IP. Each IP is driven by a command queue. // The command queues are automatically provided by the runtime. Convolution.command(0, X, Y, W, read_input_from_buffer, store_output_to_buffer, with_ReLU, is_FC_layer); Maxpool.command(1, Y, store_output_to_buffer) // Put the IPs into an overlay, and compile to a named bitstream. Overlay(Convolution, Maxpool).compile("overlay.aocx"); /* Specification 2: Define an application on the overlay */ ImageParam X(Float(32), 4), // Input feature map Y(Float(32), 4), // Output feature map after the last FC layer. W01(Float(32), 5), W23(Float(32), 5), // Weights for convolution layer 0-1, 2-3 W46(Float(32), 5), W79(Float(32), 5), // Weights for convolution layer 4-6, 7-9 W1012(Float(32), 5), // Weights for convolution layer 10-12 WFC0(Float(32), 2), WFC1(Float(32), 2), // Weights for FC layer 0 and 1 WFC2(Float(32), 2); // Weights for FC layer 2 Func ConvLayers[13](Place::Device), // 13 convolution layers FCLayers[3](Place::Device), // 3 fully connected layers MaxpoolLayers[5](Place::Device); // 5 max pooling layers

Overlay overlay = load_overlay("overlay.aocx"); // Offload the overlay bitstream to FPGA, if // not yet, and return a handle // Push to queue 0 a convolution task that reads input from DDR, stores output to the feature // buffer, with ReLU, and not a FC layer. ConvLayers[0](i)=overlay.enqueue(0, INPUT(X, i), DUMMY_O, WEIGHT(W01, 0), NO, YES, YES, NO); // Next layer. Similar to the first layer, but reads from the feature buffer ConvLayers[1](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W01, 1), YES, YES, YES, NO); // Push to queue 1 a Maxpool task. A Maxpool task always reads input from the feature buffer. // Here the task stores output to the feature buffer as well. MaxpoolLayer[0](i)=overlay.enqueue(1, DUMMY_O, YES);

ConvLayers[2](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W23, 0), YES, YES, YES, NO); ConvLayers[3](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W23, 1), YES, YES, YES, NO); MaxpoolLayer[1](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[4](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 0), YES, YES, YES, NO); ConvLayers[5](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 1), YES, YES, YES, NO); ConvLayers[6](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 2), YES, YES, YES, NO); MaxpoolLayer[2](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[7](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 0), YES, YES, YES, NO); ConvLayers[8](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 1), YES, YES, YES, NO); ConvLayers[9](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 2), YES, YES, YES, NO); MaxpoolLayer[3](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[10](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 0), YES, YES, YES, NO); ConvLayers[11](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 1), YES, YES, YES, NO); ConvLayers[12](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 2), YES, YES, YES, NO); MaxpoolLayer[4](i)=overlay.enqueue(1, Y, NO);//The last MaxPool layer store results to DDR // An FC layer always reads input from DDR, and stores output to DDR. FCLayers[0](i)=overlay.enqueue(0, Y, Y, WFC0, NO, NO, YES, YES); FCLayers[1](i)=overlay.enqueue(0, Y, Y, WFC1, NO, NO, YES, YES); FCLayers[2](i)=overlay.enqueue(0, Y, Y, WFC2, NO, NO, YES, YES); // Specify dependences between tasks in different command queues. // Tasks in the same queue are executed in order. MaxpoolLayer[0].depend(ConvLayers[1], 0);//Maxpool layer 0 depends on convolution layer 1 //with distance=0(i.e. in the same loop iteration). ConvLayers[2].depend(MaxpoolLayer[0], 0); MaxpoolLayer[1].depend(ConvLayers[3], 0); ConvLayers[4].depend(MaxpoolLayer[1], 0); MaxpoolLayer[2].depend(ConvLayers[6], 0); ConvLayers[7].depend(MaxpoolLayer[2], 0); MaxpoolLayer[3].depend(ConvLayers[9], 0); ConvLayers[10].depend(MaxpoolLayer[3], 0); MaxpoolLayer[4].depend(ConvLayers[12], 0); FCLayers[0].depend(MaxpoolLayer[4], 0); // Set input, compile and run set X, W*, and WFC* with real data, and allocate Y a space. Target target = get_host_target(); // Get the CPU target.set_feature(Target::IntelFPGA); // The CPU has a FPGA device FCLayers[2].realize(n, target); // Compile all the Funcs into a bitstream, offload // and run on the FPGA. Here n is // Y contains the results of the final FC layer. The results can be post-processed on the // host side for softmax. eferences [1] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling,and G. R. Chiu. An opencl deep learning acceleratoron arria 10. In Proceedings of the 2017 ACM/SIGDAInternational Symposium on Field-Programmable GateArrays , FPGA ’17, page 55-64, New York, NY, USA,2017. Association for Computing Machinery.[2] D. Capalija and T. S. Abdelrahman. Towards synthesis-free jit compilation to commodity fpgas. In , pages 202–205, 2011.[3] J. Coole and G. Stitt. Fast, ﬂexible high-level synthesisfrom opencl using reconﬁguration contexts.

IEEE Micro ,34(1):42–53, 2014.[4] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill,M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams,M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz,L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulﬁeld,E. S. Chung, and D. Burger. A conﬁgurable cloud-scalednn processor for real-time ai. In , pages 1–14, 2018.[5] Y.-H. Lai, H. Rong, S. Zheng, W. Zhang, X. Cui, Y. Jia,J. Wang, B. Sullivan, Z. Zhang, Y. Liang, Y. Zhang,J. Cong, N. George, J. Alvarez, C. Hughes, and P. Dubey.Susy: A programming model for productive constructionof high-performance systolic arrays on fpgas, 2020. Toappear at ICCAD 2020.[6] T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan,L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin,and A. Krishnamurthy. A hardware-software blueprintfor ﬂexible deep learning specialization.

IEEE Micro ,39(5):8–16, 2019.[7] H. Rong. Programmatic control of a compiler forgenerating high-performance spatial hardware.

CoRR ,abs/1711.07606, 2017.[8] H. K.-H. So and C. Liu. Fpga overlays. In

FPGAsfor Software Programmers , chapter 16, pages 285–305.Springer, Cham, 2016. Available: https://doi.org/10.1007/978-3-319-26408-0_16 .[9] N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao,Z. Zhang, D. Albonesi, V. Sarkar, W. Chen, P. Petersen,G. Lowney, A. Herr, C. Hughes, T. Mattson, andP. Dubey. ”t2s-tensor : Productively generatinghigh-performance spatial hardware for dense tensorcomputation”. In

Proceedings of the InternationalSymposium on Field-Programmable Custom ComputingMachines (FCCM) , 2019., 2019.