Building Application-Specific Overlays on FPGAs with High-Level Customizable IPs
BBuilding Application-Specific Overlays onFPGAs with High-Level Customizable IPs
Hongbo Rong
Parallel Computing Lab (PCL), Intel [email protected]
Abstract
Overlays are virtual, re-configurable architectures thatoverlay on top of physical FPGA fabrics [8]. An overlaythat is specialized for an application, or a class of appli-cations, offers both fast reconfiguration and minimizedperformance penalty. Such an overlay is usually imple-mented by hardware designers in hardware “assembly”languages at register-transfer level (RTL).This short article proposes an idea for a software pro-grammer , instead of hardware designers, to quickly im-plement an application-specific overlay using high-levelcustomizable IPs. These IPs are expressed succinctlyby a specification language, whose abstraction level ismuch higher than RTL but can nonetheless expressesmany performance-critical loop and data optimizationson FPGAs, and thus would offer competitively high per-formance at a much lower cost of maintenance and mucheasier customizations.We propose new language features to easily put theIPs together into an overlay. A compiler automaticallyimplements the specified optimizations to generate anefficient overlay, exposes a multi-tasking programminginterface for the overlay, and inserts a runtime schedulerfor scheduling tasks to run on the IPs of the overlay,respecting the dependences between the tasks. While anapplication written in any language can take advantageof the overlay through the programming interface, weshow a particular usage scenario, where the applicationitself is also succinctly specified in the same language.We describe the new language features for expressingoverlays, and illustrate the features with an LU decom-poser and a convolutional neural network. A system isunder construction to implement the language featuresand workloads.
1. Introduction
An FPGA has massive amount of logical elements thatare distributed, locally connected and running in par-allel, interleaved with memory blocks and often withhardened DSP blocks. The logical elements, intercon- nects, memory and DSP blocks can be synthesized tomatch a dataflow compute for the best performance andpower efficiency. However, the synthesis time tends tobe very long: even a small design may take tens of min-utes, and a larger design can easily take hours or evendays.Overlays have been proposed to cut down the synthe-sis time. Overlays are virtual, re-configurable architec-tures that overlay on top of physical FPGA fabrics [8].An overlay usually has (much) coarser granularity, andthus (much) smaller amount, of resources that can be re-configured. Therefore, the resources of an overlay can besynthesized for a dataflow compute at a radically fasterspeed than the traditional hardware synthsis [8, 3, 2].An overlay offers software programmers a software-like programming experience: An overlay is built withhardware IPs on top of an FPGA; the hardware IPshave a higher-abstraction level (e.g. matrix or vectorlevel), and thus programmers can program the overlayat that higher abstraction level instead, reaching higherproductivity at a reasonable performance cost.However, there are remaining problems: • An overlay itself is usually still implemented at RTL,and by hardware experts, with a high developmentcost. • Overlays are often available only for hot domains orapplications (e.g. deep learning these days [4, 6, 1]).Existing overlays might not necessarily well matchnew algorithms, applications or domains.This short article proposes an idea to enable a software programmer , instead of hardware experts, toquickly build an application-specific overlay on anFPGA using high-level customizable IPs. These IPs aresuccinctly specified : the dataflow of an IP is expressed ina functional notation, followed by a description how toefficiently map the dataflow onto the spatial FPGA ar-chitecture with many loop and data optimizations, e.g.how to map the dataflow onto a systolic array that wellmatches the underlying FPGA architecture and thus iscritical for performance. a r X i v : . [ c s . P L ] S e p he IPs are only specified, while the detailed im-plementation of the specified optimizations is left to acompiler. The specification language and compiler usedis T2S (Temporal To Spatial) [7]. Our previous work onT2S has proved that a smart compiler can generate effi-cient IPs with a fraction of development time but withcompetitive performance, compared with the same IPsthat are optimized in the same set of optimizations butthe optimizations are implemented manually by expertsin high-level synthesis (HLS) languages [9, 5] .Since the IPs are written at an abstraction level muchhigher than RTL, the IPs require much lower mainte-nance cost, are much easier to customize by softwareprogrammers, and on the hand, with the right set of op-timizations, can offer competitively high performance.The compiler will automatically expose a multi-tasking programming interface for an overlay, and inserta runtime scheduler for scheduling tasks to run on theoverlay, respecting the dependences between the tasks.While an application written in any language can takeadvantage of the overlays through the programming in-terfaces, we show a particular usage scenario, where theapplication itself is also succinctly specified in the samelanguage.This approach is generally applicable to many appli-cations that have many tasks and the tasks need sharelimited FPGA resources. We will illustrate the approachwith a VGG convolutional neural network and a blockedLU decomposer. We will define an overlay for each ofthem; each overlay contains a few IPs on an FPGA. Forthe neural network, we map and schedule all the lay-ers to an overlay. For the blocked LU decomposer, wedynamically generate tasks, and schedule them to theother overlay.This article focuses on describing the idea. We arebuilding a prototype to implement the proposed idea,leveraging our current systems [9, 5]. We will report theprogress in future publications.
2. Overall Flow
Fig. 1 shows the overall flow. A programmer specifies adefinition of an overlay. Directed by the specification, acompiler automatically links the overlay definition with We believe that if the compiler is engineered right, the perfor-mance of an IP will be mainly determined by the set of opti-mizations used for the IP, not by whether the optimizations areautomatically implemented or manually implemented . This be-lief has been supported by our current prototypes. Our currentprototypes generates HLS code only, and thus we compare onlywith expert-written HLS code. However, there is no restrictionfor our approach to generate RTL code, which is purely an engi-neering effort. When generating RTL code, we believe the samephenomenon will repeat: IPs specified in our language and imple-mented in detail by the automatic compiler should exhibit com-petitive performance vs. expert-written RTL code with the sameset of optimizations. We will verify this belief in future. a pre-written runtime system and synthesize them intoa bitstream for an FPGA, and generates a programminginterface for the overlay. The runtime system includescommand queues, a task graph and a scheduler.The overlay generated on the FPGA can be invokedto run by an application written in any language by call-ing the programming interface. A particular interestingscenario is that the application is also written in thesame specification language. In Fig. 1, we show that aprogrammer specifies an application to run on the over-lay. The compiler synthesizes the application with theprogramming interface into another bitstream.Then the compiler offloads both the overlay and theapplication to an FPGA. When the programmer invokesthe application to run, the application automaticallygenerates tasks for the runtime system to schedule torun on the overlay. The example application shown inthe figure is an LU decomposer, which has many tasksof 4 kinds generated during the execution, dispatchedby the runtime to run on the 4 corresponding hardwareIPs in the overlay. We will describe this example in moredetail below.
3. Examples
In this section, we illustrate our idea with an LU decom-poser and VGG convolutional neural network. Insteadof using formal definitions, we will intuitively and ef-fectively explain our language features through theseexamples.
For a matrix A = (cid:18) A A A A (cid:19) , we would like to de-compose it into A = LU = (cid:18) L L L (cid:19) (cid:18) U U U (cid:19) .Therefore, it is easy to see that A = L U (1) A = L U (2) A = L U (3) A = L U + L U (4)Therefore, A = L U (5) U = L − A (6) L = A U − (7) L U = A − L U (8)We can generalize this example. Suppose the originalsquare matrix A is divided into n ∗ n sqaure blocks, eachblock having m ∗ m elements. The algorithm of blockedLU is shown in Algorithm 1.We vision that a T2S specification can be written asshown in Fig. 1. There are 4 hardware IPs: igure 1: The overall flow Algorithm 1:
The blocked LU decomposition al-gorithm. for ( i = 0; i < n ; i ++) Task 0: decompose A ii = L ii U ii Task 1: calculate U i, ( i +1): n = L − ii A i, ( i +1): n Task 2: calculate L ( i +1): n,i = A ( i +1): n,i U − ii Task 3: calculate A ( i +1): n, ( i +1): n -= L ( i +1): n,i U i, ( i +1): n • LU , which accepts a square block A with the sizeof m ∗ m , and decomposes it into matrix L and U ,and store them at the same space of A . Note the diagonal of L contains only 1’s, and thus not stored. • TransformRowPanel , which accepts a row panelwith a number of blocks, each block with the sizeof m ∗ m , and uses the first block (corresponding to L ii ) to transform the other blocks, i.e. L − ii A i, ( i +1): n . • TransformColumnPanel , which accepts a columnpanel with a number of blocks, each block withthe size of m ∗ m , and uses the first block (corre-sponding to U ii ) to transform the other blocks, i.e. A ( i +1): n,i U − ii . GEMM , which accepts a matrix
C, A, B and co-efficient α, β, γ , and computes C = αC + βA ∗ γB .All the 4 IPs do in-place update: they write their out-puts into the same space of their inputs.In Fig. 1, the specifications use several features newto the T2S language: • The
Overlay type is a container for the IPs andruntime system. • F.command(queueNo, parameters) specifies a pro-gramming interface for Func F : the command queueand the parameters. • O.enequeue(queueNo, parameters) is to enqueuea command to the given command queue of theoverlay O with the given parameters. • T1.depend(T2, d, [condition]) says that underan optional condition , task T1 in the current itera-tion depends on task T2 in d iterations before. • A.BCropped(m, startRow, endRow, startCol,endCol) means to crop, in blocks of m ∗ m , from abuffer A , from the given start to end row (included),and from the given start to end column (included).The cropping is in-place, and thus the croppedbuffer shares the space with the original buffer.We can explain Fig. 1 in more detail. A software pro-grammer writes two specifications, one for the overaly,and the other for the application (i.e. LU decomposer).In the specification of the overlay, Line 1-3 declarethe 4 IPs on an (FPGA) device. Line 4-7 declare the in-puts of the IPs. Line 8 defines the IPs. We assume thatthe IPs have already been specified with necessary opti-mizations in the T2S language by experts, and are pro-vided to the programmer as a library of building blocks.Therefore, we skip the details of the definitions of theIPs here. Line 9-12 define a programming interface foreach IP. Each is driven by a command queue, which isautomatically provided by a runtime system. Finally,Line 13-14 put the IPs into an overlay, and compile theoverlay to a named bitstream.In the specification of the application, Line 1-2 de-clare the matrix to be decomposed, and 4 kinds oftasks corresponding to the 4 IPs. Line 3-9 defines somemacros that are only for the convenience of usage next.Line 10 offloads the overlay’s bitstream to an FPGA,if not yet, and returns a handle. Line 11-14 generate 4tasks and enqueue them into the command queues of thecorresponding IPs. Note that there is an implicit loop i around the tasks. In this way, Algorithm 1 is expressed.Linie 15-18 specify the dependences between the tasks.Line 19-22 set up the input matrix, compile the appli-cation into a bitstream, and run it on the FPGA.The two specifications are compiled to run on thesame FPGA. The compiler will automatically generate Figure 2: A design for VGG networkfor the overlay specification a programming interface,which is used for compiling the application specification.A runtime system is automatically linked to the over-lay by the compiler. The runtime system is composedof command queues and a task graph and scheduler.Each IP has a command queue containing tasks to beexecuted. The dependences between any two tasks arerepresented by a task graph and managed by a sched-uler dynamically. How to write such a runtime systemis a known technique. A design for VGG is shown in Fig 2. There is an over-lay and an application on an FPGA. The overlay has2 hardware IPs: Convolution and Maxpool. All con-volution layers (with and without ReLU) and fully-connected (FC) layers can be computed by the Con-volution IP, and all the max pooling layers can be com-puted by the Maxpool IP. The feature map between twolayers can be communicated by external DDR, or by anon-chip feature buffer. Inside a layer, the ConvolutionIP has a weight buffer.Algorithm 2 shows for VGG two specifications, fol-lowing the same principle for the previous LU example.We leave a detailed explanation to the comments there.
4. Conclusion and Future Work
We have proposed an idea for a software programmerto quickly build an application-specific overlay on anFPGA, using high-level customizable IPs. We have il-lustrated the idea with LU decomposition and VGGconvolutional neural network. We are building a systemto implement the proposed idea, leveraging our previ-ous work on T2S. We will report the progress in futurepublications. lgorithm 2: Example specifications for VGG. /* Specification 1: Define an overlay */ Func Convolution(Place::Device), Maxpool(Place::Device); // Two HW IPs on the device (FPGA) ImageParam X(Float(32), 3), Y(Float(32), 3); // Input and output feature map ImageParam W(Float(32), 4); // Weights Expr read_input_from_buffer, store_output_to_buffer, // Control signals to reconfigure with_ReLU, is_FC_layer; // the overlay. Functional notations and spatial mapping for the Funcs // Expressible in known state-of- // art [9, 5]. Details skipped. // Define a programming interface for each HW IP. Each IP is driven by a command queue. // The command queues are automatically provided by the runtime. Convolution.command(0, X, Y, W, read_input_from_buffer, store_output_to_buffer, with_ReLU, is_FC_layer); Maxpool.command(1, Y, store_output_to_buffer) // Put the IPs into an overlay, and compile to a named bitstream. Overlay(Convolution, Maxpool).compile("overlay.aocx"); /* Specification 2: Define an application on the overlay */ ImageParam X(Float(32), 4), // Input feature map Y(Float(32), 4), // Output feature map after the last FC layer. W01(Float(32), 5), W23(Float(32), 5), // Weights for convolution layer 0-1, 2-3 W46(Float(32), 5), W79(Float(32), 5), // Weights for convolution layer 4-6, 7-9 W1012(Float(32), 5), // Weights for convolution layer 10-12 WFC0(Float(32), 2), WFC1(Float(32), 2), // Weights for FC layer 0 and 1 WFC2(Float(32), 2); // Weights for FC layer 2 Func ConvLayers[13](Place::Device), // 13 convolution layers FCLayers[3](Place::Device), // 3 fully connected layers MaxpoolLayers[5](Place::Device); // 5 max pooling layers
Overlay overlay = load_overlay("overlay.aocx"); // Offload the overlay bitstream to FPGA, if // not yet, and return a handle // Push to queue 0 a convolution task that reads input from DDR, stores output to the feature // buffer, with ReLU, and not a FC layer. ConvLayers[0](i)=overlay.enqueue(0, INPUT(X, i), DUMMY_O, WEIGHT(W01, 0), NO, YES, YES, NO); // Next layer. Similar to the first layer, but reads from the feature buffer ConvLayers[1](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W01, 1), YES, YES, YES, NO); // Push to queue 1 a Maxpool task. A Maxpool task always reads input from the feature buffer. // Here the task stores output to the feature buffer as well. MaxpoolLayer[0](i)=overlay.enqueue(1, DUMMY_O, YES);
ConvLayers[2](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W23, 0), YES, YES, YES, NO); ConvLayers[3](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W23, 1), YES, YES, YES, NO); MaxpoolLayer[1](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[4](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 0), YES, YES, YES, NO); ConvLayers[5](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 1), YES, YES, YES, NO); ConvLayers[6](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W46, 2), YES, YES, YES, NO); MaxpoolLayer[2](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[7](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 0), YES, YES, YES, NO); ConvLayers[8](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 1), YES, YES, YES, NO); ConvLayers[9](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W79, 2), YES, YES, YES, NO); MaxpoolLayer[3](i)=overlay.enqueue(1, DUMMY_O, YES); ConvLayers[10](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 0), YES, YES, YES, NO); ConvLayers[11](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 1), YES, YES, YES, NO); ConvLayers[12](i)=overlay.enqueue(0, DUMMY_I, DUMMY_O, WEIGHT(W1012, 2), YES, YES, YES, NO); MaxpoolLayer[4](i)=overlay.enqueue(1, Y, NO);//The last MaxPool layer store results to DDR // An FC layer always reads input from DDR, and stores output to DDR. FCLayers[0](i)=overlay.enqueue(0, Y, Y, WFC0, NO, NO, YES, YES); FCLayers[1](i)=overlay.enqueue(0, Y, Y, WFC1, NO, NO, YES, YES); FCLayers[2](i)=overlay.enqueue(0, Y, Y, WFC2, NO, NO, YES, YES); // Specify dependences between tasks in different command queues. // Tasks in the same queue are executed in order. MaxpoolLayer[0].depend(ConvLayers[1], 0);//Maxpool layer 0 depends on convolution layer 1 //with distance=0(i.e. in the same loop iteration). ConvLayers[2].depend(MaxpoolLayer[0], 0); MaxpoolLayer[1].depend(ConvLayers[3], 0); ConvLayers[4].depend(MaxpoolLayer[1], 0); MaxpoolLayer[2].depend(ConvLayers[6], 0); ConvLayers[7].depend(MaxpoolLayer[2], 0); MaxpoolLayer[3].depend(ConvLayers[9], 0); ConvLayers[10].depend(MaxpoolLayer[3], 0); MaxpoolLayer[4].depend(ConvLayers[12], 0); FCLayers[0].depend(MaxpoolLayer[4], 0); // Set input, compile and run set X, W*, and WFC* with real data, and allocate Y a space. Target target = get_host_target(); // Get the CPU target.set_feature(Target::IntelFPGA); // The CPU has a FPGA device FCLayers[2].realize(n, target); // Compile all the Funcs into a bitstream, offload // and run on the FPGA. Here n is // Y contains the results of the final FC layer. The results can be post-processed on the // host side for softmax. eferences [1] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling,and G. R. Chiu. An opencl deep learning acceleratoron arria 10. In Proceedings of the 2017 ACM/SIGDAInternational Symposium on Field-Programmable GateArrays , FPGA ’17, page 55-64, New York, NY, USA,2017. Association for Computing Machinery.[2] D. Capalija and T. S. Abdelrahman. Towards synthesis-free jit compilation to commodity fpgas. In , pages 202–205, 2011.[3] J. Coole and G. Stitt. Fast, flexible high-level synthesisfrom opencl using reconfiguration contexts.
IEEE Micro ,34(1):42–53, 2014.[4] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill,M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams,M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz,L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield,E. S. Chung, and D. Burger. A configurable cloud-scalednn processor for real-time ai. In , pages 1–14, 2018.[5] Y.-H. Lai, H. Rong, S. Zheng, W. Zhang, X. Cui, Y. Jia,J. Wang, B. Sullivan, Z. Zhang, Y. Liang, Y. Zhang,J. Cong, N. George, J. Alvarez, C. Hughes, and P. Dubey.Susy: A programming model for productive constructionof high-performance systolic arrays on fpgas, 2020. Toappear at ICCAD 2020.[6] T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan,L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin,and A. Krishnamurthy. A hardware-software blueprintfor flexible deep learning specialization.
IEEE Micro ,39(5):8–16, 2019.[7] H. Rong. Programmatic control of a compiler forgenerating high-performance spatial hardware.
CoRR ,abs/1711.07606, 2017.[8] H. K.-H. So and C. Liu. Fpga overlays. In
FPGAsfor Software Programmers , chapter 16, pages 285–305.Springer, Cham, 2016. Available: https://doi.org/10.1007/978-3-319-26408-0_16 .[9] N. Srivastava, H. Rong, P. Barua, G. Feng, H. Cao,Z. Zhang, D. Albonesi, V. Sarkar, W. Chen, P. Petersen,G. Lowney, A. Herr, C. Hughes, T. Mattson, andP. Dubey. ”t2s-tensor : Productively generatinghigh-performance spatial hardware for dense tensorcomputation”. In
Proceedings of the InternationalSymposium on Field-Programmable Custom ComputingMachines (FCCM) , 2019., 2019.