DB4HLS: A Database of High-Level Synthesis Design Space Explorations
Lorenzo Ferretti, Jihye Kwon, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca Carloni, Laura Pozzi
11 DB4HLS: A Database of High-Level SynthesisDesign Space Explorations
Lorenzo Ferretti , Jihye Kwon , Giovanni Ansaloni , Giuseppe Di Guglielmo , Luca Carloni , Laura Pozzi Universit`a della Svizzera italiana, Lugano, Switzerland, Columbia University, New York, United States EPFL, Lousanne, Switzerland
Abstract —High-Level Synthesis (HLS) frameworks allow toeasily specify a large number of variants of the same hardwaredesign by only acting on optimization directives. Nonetheless,the hardware synthesis of implementations for all possiblecombinations of directive values is impractical even for simpledesigns. Addressing this shortcoming, many HLS Design SpaceExploration (DSE) strategies have been proposed to devisedirective settings leading to high-quality implementations whilelimiting the number of synthesis runs. All these works requireconsiderable efforts to validate the proposed strategies and/orto build the knowledge base employed to tune abstract models,as both tasks mandate the syntheses of large collections ofimplementations. Currently, such data gathering is performedad-hoc, a) leading to a lack of standardization, hamperingcomparisons between DSE alternatives, and b) posing a very highburden to researchers willing to develop novel DSE strategies.Against this backdrop, we here introduce DB4HLS, a databaseof exhaustive HLS explorations comprising more than 100000design points collected over 4 years of synthesis time. The openstructure of DB4HLS allows the incremental integration of newDSEs, which can be easily defined with a dedicated domain-specific language. We think that of our database, available at , will be a valuable toolfor the research community investigating automated strategies forthe optimization of HLS-based hardware designs.
Index Terms —High-Level Synthesis, Databases, MachineLearning, Big Data, Design Space Exploration.
I. I
NTRODUCTION
High-Level Synthesis (HLS) fostered a revolution in hard-ware design. HLS frameworks allow the specification of hard-ware components in languages such as C, C++, or SystemC.As opposed to traditional Register Transfer Level (RTL) ap-proaches, HLS flows do not require detailed descriptions of thelogic gates, memory elements and interconnects comprisinghardware implementations. Instead, these are automaticallygenerated, based on the high-level specifications and on aset of directive values specifying optimizations such as theunrolling factor of loops and the inlining of functions. Bydecoupling specification from implementation, HLS allowsunprecedented productivity, leading to considerable reductionsin non-recurring engineering costs.Nonetheless, while HLS allows to easily define vast de-sign spaces for a given hardware specification, determiningthe performance (latency) and resource requirements (area,power) of each implementation still requires time-consumingsyntheses. The amount of possible implementations of a designexplodes exponentially with the number of applied directives,while, in general, only a few of them are Pareto-optimal from a performance/resources perspective. Exhaustive explorationsare therefore wasteful (since only Pareto implementations areof interest) and impractical beyond very simple cases.Various strategies, which we summarize in Section II, havebeen proposed to identify (or approximate) the set of Pareto-implementations while minimising the number of synthesisruns [1] [2] [3]. This problem is named
HLS-driven DesignSpace Exploration (DSE) . The proposed DSEs strategies aretypically validated against exhaustive explorations, which theauthors performed ad-hoc. Moreover, works such as [4] [5]rely on prior knowledge to steer the HLS exploration pro-cess. Performing the huge number of synthesis required forvalidation or for generating a high-quality knowledge baseentails a very high effort, which at present must be repeatedex-novo when investigating the performance of a novel DSEmethodology.Against this backdrop, we introduce DB4HLS, a database ofhigh-level synthesis design space explorations. The databasecomprises more than 100000 design points, reporting the syn-thesis outcomes of exhaustive explorations performed on 39designs from the MachSuite [6] benchmark suite. In addition,we define a simple domain-specific language to define designspaces, resulting in an open infrastructure that can be enrichedby further contributions from the research community.We believe that, by providing standardized synthesis datasets, our effort will allow easier comparisons among DSEstrategies, enabling fairer evaluations of the strengths andweaknesses of each approach. It will also facilitate the de-velopment and assessment of future design exploration frame-works, spurring research in this challenging field.II. R
ELATED W ORKS
State of the art DSE frameworks for HLS follow threemain approaches. Black-box methodologies aim, after aninitial phase, at iteratively refining explorations by smartlyselecting additional design points. To this end, they employunsupervised learning strategies such as clustering [2], randomforest [1], lattice traversing [3] and response surface models[7]. Model-based strategies, on the other hand, estimate per-formance and resource requirements of implementations bydeveloping an analytical formulation of the effect of directiveswhen applied to a design. Typically, they can well approximatethe Pareto set of best-performing implementations with fewsynthesis, but are restricted in the type of targeted optimiza-tions (e.g., loop unrolling and dataflow in [8]). The authors of a r X i v : . [ c s . A R ] J a n TABLE I:
DSEs available in the database. Each entry reportsbenchmark, function name, and number of configurations ( | CS | ). Allfunctions are from Machsuite [6]. Benchmark Function name | CS | spmv ellpack ellpack 1600bfs bulk bulk 2352md knn md kernel 1600viterbi viterbi 1152gemm ncubed gemm 2744gemm blocked bbgemm 1600fft strided fft 64sort merge ms mergesort 4096merge 4096stencil stencil2d stencil 1344stencil stencil3d stencil3d 1536radix sort update 2400hist 4704init 484sum scan 1280last step scan 800local scan 704ss sort 1792aes aes addRoundKey 500aes subBytes 50aes addRoundKey cpy 625aes shiftRows 20aes mixColumns 18aes expandEncKey 216aes256 encrypt ecb 1944backprop get delta matrix weights1 21952get delta matrix weights2 31213get delta matrix weights3 21952get oracle activations1 2401get oracle activations2 1372product with bias input layer 1372product with bias second layer 686product with bias output layer 392backprop 2048add bias to activations 1372soft max 64take difference 512update weights 1024 all these works adopt as figure of merit either the Hypervolumeor the Average Distance from Reference Set (ADRS) forvalidation, and both require the computation of true Paretofrontiers from exhaustive explorations. Recently, a promisingresearch avenue has focused, instead, on exploiting priorknowledge in order to perform Design Space Exploration inhardware design. These works [4] [5] leverage the availabilityof a comprehensive knowledge base, such as the one wedescribe in our paper, to achieve exploration results close tothat of model-based strategies while being much more flexiblein the number and type of supported directives.While benchmark suites dedicated to hardware design areavailable, such as CHStone [9], MachSuite [6], Rosetta [10]and S2CBench [11], they only provide specifications (in theform C/C++ code) as benchmarks. Conversely, our DB4HLSsuite offers rich and well-defined design spaces and relatedsynthesis outcomes, greatly easing the burden of performingcomparative evaluations of exploration methodologies. To thebest of our knowledge, this is the first database of HLSimplementation made publicly available with the intent ofstandardize the evaluation process, and provide a source ofknowledge for ML strategies. benchmark designalgorithm configurationspaceconfigurationimplementationsynthesis informationresource results performanceresults name Many-to-one One-to-one Entity middlebottomtop
Fig. 1:
Simplified scheme of the Entity-Relationship Diagram (ERD)of the DB4HLS syntheses database.
III. A
VAILABLE DESIGN SPACE EXPLORATIONS
We provide a rich set of DSEs by targeting the benchmarksof the MachSuite collection of designs [6]. We have performedDSEs for 39 out of 50 functions in the benchmark suite,discarding those having a variable latency due to input-dependent control flows, and those having very small designspaces. The considered functions present on average 40 linesof code, with the biggest having 308 lines of code.We performed an exhaustive exploration of each design–according to the configuration space defined by the user–running more than 100000 synthesis. Table I lists all thedesigns explored and their configuration space size.We used Vivado HLS [12] version 2018.2 to perform thesyntheses , and we targeted a ZynqMP Ultrascale+ ( xczu9eg )FPGA chip, with a target clock of ns .To restrain the design spaces sizes, we have constraineddirective set values with a numerical range (e.g., the unrollingfactor) to power-of-two or integer divisor of the maximumadmissible values (e.g., number of loop iterations). Moreover,for some designs, different optimizations are forced to have thesame values when intuitively such choice would lead to bettercost/performance trade-offs (e.g., binding the loop unrollingfactor to the array partitioning one).Even when considering these constraints, the data collectionrequired more than 4 years of single-core machine time. Tospeed up this process, GNU Parallel was adopted to collectsynthesis results from 60 parallel Vivado HLS instances,allowing us to populate the database in approximately 25 daysof wall-clock time.IV. DB4HLS INFRASTRUCTURE
In addition to the DSE data, the DB4HLS framework offersa) a database infrastructure hosting DSE in a structured andeasy-to-access way, b) a domain-specific language used todescribe a configuration space for a target design, c) aninterface to generate new explorations and further enrich thedatabase. The remaining of this section describes these furthercontributions in details.
A. A database for DSEs
Snippet 1: last_step_scan design (C code). void last step scan( int bucket[SIZE], int sum[RADIX]) { int i, j, k;3 loop 1: for (i = 0; i < RADIX;i++) { for (j = 0; j < BLOCK; j++) { } } } Snippet 2: Configuration Space of last_step_scan . { RAM 2P BRAM } { RAM 2P BRAM } { cyclic,block } ; { > } { cyclic,block } ; { > } @bind a5 unroll;last step scan;last 1; { > } @bind a6 unroll;last step scan;last 2; { } { } Fig. 2:
Left: Snippet of the last_step_scan
C code function from MachSuite [6]. We rewrote the code to increase the readabilitywithout affecting its functionality. Right: An associated Configuration Space Descriptor (CSD).
The database structure, implemented in MySQL, comprisesa description of the design targeted for exploration (top part ofFigure 1), and that of the explored HLS optimizations appliedto each design (middle part of Figure 1). Finally, it reportsthe resource and performance results obtained by synthesis(as described in the bottom part of the figure). Each of thesecomponents is described more in detail in the following.Similarly to the taxonomy adopted in MachSuite [6],applications are identified by the benchmark they be-long to (e.g.: aes ), by the algorithm they realize (e.g.: aes256_encrypt ) and by the design implementing suchalgorithms. As an example, two variants are provided byMachSuite for the aes256_encrypt algorithm (one usinglookup tables to store encryption keys and one generatingthe values online), each corresponding to a separate designspecified as C++ code.A descriptor of the HLS optimizations considered for theDSEs are stored as entries in configuration space table. Multi-ple explorations (hence, rows in the configuration space table)for the same design are possible, corresponding to differentchoices of optimizations, or explorations targeting differenttools/FPGAs, or even contributions from different researchers.An entry in the configuration space table is linked to manyentries of the configuration table, where each entry indicatesa specific element of the design space.A line in the configuration table (that indicates the setof HLS optimizations defining a design space element) islinked to an entry in the implementation table. Furthermore,the synthesis information table provides additional informationon each performed synthesis: the synthesis timestamp, thecontributor that originated the data, the employed synthesistool and version, and the targeted FPGA. Finally, each im-plementation links to one or more entries in the resources and performance tables, which report the synthesis outcomes.Resources are expressed as employed Flip-Flops, Look-UpTables, Block RAMs (BRAM) and DSP blocks, while per-formances are reported in terms of effective latency.
B. A domain-specific language for DSEs
Generating the different configurations associated with anDSE is a tedious and error-prone process when performed byhand. We therefore developed a Domain-Specific Language(DSL) to automatically and concisely define configurationspaces by employing Configuration Space Descriptors (CSDs).Each line of a descriptor encodes a knob , which comprisesa directive type, a label corresponding to its location in the design C/C++ code, and one or multiple sets of values. Thenumber of sets is equal to the number of parameters requiredby the directive type. Values can be numerical when expressingoptimizations such as loop unrolling or array partitioningfactors, or categorical when determining the type of employedFPGA resources such as BRAM types. A shorthand is pro-vided for expressing regular value series (e.g., a successionof power-of-two values). Finally, we provide a @bind dec-orator, which constraints the values associated with differentdirectives.Figure 2 shows, for the last_step_scan function inSnippet 1, an example of DSL descriptor created to define itsconfiguration space (Snippet 2) created using the DSL. TheDSL descriptor defines seven different knobs. Lines 1 and 2of Snippet 2 show two knobs associating a dual-port BRAMto the input array bucket , and sum respectively. Lines 3 and 4define knobs specifying the array partitioning directive. Thesedirectives are created as combinations of partitioning strategiesand partitioning factors. Both line 3 and 4 combine twopartitioning strategies ( cyclic and block ) with the associateddirective values set for the partitioning factors–all the powersof two from 1 up to 512 for knob 3, and all the powers oftwo from 1 up to 128 for knob 4. Then line 5 and 6 define for loop_1 and loop_2 the associated set of unrolling factors toconsider during the exploration, all the powers of two from 1up to 128 and 16, respectively. Both line 4 and 5 have a bindingdecorator ( @bind_a ), that specifies that the array partitioningdirective and the unrolling one must have the same partitioningand unrolling factor for all the configurations described by theCSD. Finally line 7 defines the target clock.The configuration space resulting from a DSL descriptorhaving N different knobs is the Cartesian product of all knobvalues: CS = K × K × · · · × K N ; where K i is the directivevalues set related to knob i , taking into account the restrictionsimposed by the bind decorator. In case of directives requiringmultiple parameters, the knob K i is itself the Cartesian productamong each set of values associated to the knob. Lastly, thetotal number of configurations, i.e., the configuration spacesize, is given by its cardinality ( | CS | ).The configuration space descriptor of Figure 2 in Snippet2 describes a configuration space with 1600 different config-urations. Without the binding decorator, the cardinality of theconfiguration space would be 12800. gnu-parallelconfiguration 1 configuration 2 configuration N…synthesisinvocation 1 synthesisinvocation 2 synthesis invocation M synthesis process 1 synthesis process 2 synthesis process MConfigurationspace descriptor…… DatabaseCollectresultsCollectconfigs. Fig. 3:
Adding DSE to DB4HLS with parallel processing.
C. A framework for parallelizing HLS runs
Figure 3 gives a high-level view of the infrastructure,realized through Bash and Python scripts, which we provideto automate DSE and commit their outcomes in DB4HLS.Starting from a user-provided design and Configuration SpaceDescriptor (CSD), configuration files are automatically gener-ated and stored in the database. Then, using GNU Parallel [13],a tunable number of instances of an employed HLS tool (weuse Vivado HLS for the data collection described in SectionIII) are concurrently and independently executed, one foreach configuration. As synthesis runs terminate, the retrievedperformance and resources information are also stored inDB4HLS, and new HLS processes are launched until allconfigurations have been explored.MySQL statements can then be used to retrieve data fromthe tables in the database and to access the design’s implemen-tations and the associated performance and resources results.V. C
ASE S TUDY
Herein, we showcase two possible uses of DB4HLS. Weuse the database both to compare the results of two DSEmethodologies, and as a source of knowledge for one ofthem. We employed a lattice-based strategy (LB) from [3],and one leveraging prior knowledge (PK) [5], to performDSEs for the local_scan design space available in DB4HLS.Figure 4 reports the Pareto curve obtained by LB and PKfor the local_scan design space. Grey dots represent thearea and latency of the 704 implementations belonging to the local_scan design space provided by DB4HLS. The figurealso reports the approximated Pareto fronts retrieved by thelattice methodology described in [3] (LB) and by the prior-knowledge strategy in [5] (PK).In this scenario, DB4HLS is employed to comparativelyevaluate the two strategies, without requiring to re-run ex-novoa large number of time-consuming synthesis runs. Besides,for PK, the database mandates the availability of a set of source design spaces in order to extract previous knowledge.In fact, DB4HLS can be effectively employed in these cases,or in similar ML-based methods [4], to provide the requiredknowledge base. A re a PK LB Exhaustive
Fig. 4:
Example of DSEs comparison employing DB4HLS.
VI. C
ONCLUSIONS
DB4HLS offers an extensive set of DSEs targeting functionsfrom MachSuite [6]. The data collection is made publiclyavailable and will be will be updated increasing the numberof design explorations and targeted benchmarks. In addition,further design spaces can be effectively defined through anovel domain-specific language and a framework to efficientlycontribute novel explorations to DB4HLS. Both the DB4HLSdatabase and the framework for DSE generation are publiclyavailable at . R EFERENCES[1] H.-Y. Liu and L. P. Carloni, “On learning-based methods for design-space exploration with high-level synthesis,” in
Proceedings of the 50thDesign Automation Conference , Jun. 2013, pp. 1–6.[2] L. Ferretti, G. Ansaloni, and L. Pozzi, “Cluster-based heuristic forhigh level synthesis design space exploration,”
IEEE Transactions onEmerging Topics in Computing , no. 99, pp. 1–9, Jan. 2018.[3] ——, “Lattice-traversing design space exploration for high level syn-thesis,” in
Proceedings of the International Conference on ComputerDesign , Oct. 2018, pp. 210–217.[4] Z. Wang, J. Chen, and B. C. Schafer, “Efficient and robust high-level synthesis design space exploration through offline micro-kernelspre-characterization,” in . IEEE, 2020, pp. 145–150.[5] L. Ferretti, J. Kwon, G. Ansaloni, G. Di Guglielmo, L. Carloni,and L. Pozzi, “Leveraging prior knowledge for effective design-spaceexploration in high-level synthesis,” in
CASES . IEEE, 2020, pp. 145–150.[6] B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, “Machsuite:Benchmarks for accelerator design and customized architectures,” in
Proceedings of the IEEE International Symposium on Workload Char-acterization , Oct. 2014, pp. 110–119.[7] S. Xydis, G. Palermo, V. Zaccaria, and C. Silvano, “SPIRIT: Spectral-Aware Pareto Iterative Refinement Optimization for Supervised High-Level Synthesis,”
IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , vol. 34, no. 1, pp. 155–159, Oct. 2015.[8] G. Zhong, V. Venkataramani, Y. Liang, T. Mitra, and S. Niar, “DesignSpace Exploration of Multiple Loops on FPGAs Using High LevelSynthesis,” in
Proceedings of the International Conference on ComputerDesign , Dec. 2014, pp. 456–463.[9] Y. Hara, H. Tomiyama, S. Honda, H. Takada, and K. Ishii, “Chstone: Abenchmark program suite for practical c-based high-level synthesis,” in . IEEE,2008, pp. 1192–1195.[10] Y. Zhou, U. Gupta, S. Dai, R. Zhao, N. Srivastava, H. Jin, J. Featherston,Y.-H. Lai, G. Liu, G. A. Velasquez et al. , “Rosetta: A realistic high-level synthesis benchmark suite for software programmable fpgas,”in
Proceedings of the 2018 ACM/SIGDA International Symposium onField-Programmable Gate Arrays , 2018, pp. 269–278.[11] B. C. Schafer and A. Mahapatra, “S2cbench: Synthesizable systemcbenchmark suite for high-level synthesis,”