IronMan: GNN-assisted Design Space Exploration in High-Level Synthesis via Reinforcement Learning
IIronMan: GNN-assisted Design Space Exploration in High-LevelSynthesis via Reinforcement Learning
Nan Wu [email protected] of California, Santa BarbaraSanta Barbara, CA, USA
Yuan Xie [email protected] of California, Santa BarbaraSanta Barbara, CA, USA
Cong Hao [email protected] Institute of TechnologyAtlanta, GA, USA
ABSTRACT
Despite the great success of High-Level Synthesis (HLS) tools, weobserve several unresolved challenges: 1) the high-level abstrac-tion of programming styles in HLS sometimes conceals optimiza-tion opportunities; 2) existing HLS tools do not provide flexibletrade-off (Pareto) solutions among different objectives and con-straints; 3) the actual quality of the resulting RTL designs is hardto predict. To address these challenges, we propose an end-to-endframework, namely
IronMan , in this work. The primary goal is toenable a flexible and automated design space exploration (DSE), toprovide either optimal solutions under user-specified constraints,or various trade-offs among different objectives (such as differenttypes of resources, area, and latency). Such design space explo-ration either requires tedious manual efforts or is not achievableto attain these goals through existing HLS tools. There are threecomponents in
IronMan : 1)
GPP , a highly accurate graph-neural-network-based performance and resource predictor; 2)
RLMD , areinforcement-learning-based multi-objective DSE engine that ex-plores the optimal resource allocation strategy, to provide Paretosolutions between different objectives; 3) CT , a code transformerto assist RLMD and GPP, which extracts the data flow graph fromoriginal HLS C/C++ and automatically generates synthesizable codewith HLS directives. The experimental results show that, 1) GPP achieves high prediction accuracy, reducing prediction errors ofHLS tools by 10 . × in resource utilization and 5 . × in timing; 2) RLMD obtains optimal or Pareto solutions that outperform thegenetic algorithm and simulated annealing by 12 .
7% and 12 . IronMan is able to find optimized solutions perfectlymatching various DSP constraints, with 2 . × fewer DSPs and upto 6 × shorter latency than those of HLS tools. IronMan is also upto 400 × faster than the heuristic algorithms and HLS tools. High-Level Synthesis (HLS) benefits the ASIC and FPGA designautomation by enabling automated transformation from behavioraldescription in high-level languages (C/C++, etc.) to RTL-level de-signs. Besides widely used commercial HLS tools for FPGA [17] andASIC [1], a great amount of recent effort focuses on improving RTLdesign quality such as loop transformation and memory allocation[2, 21], performance and/or resource prediction [8, 18, 19], and thedesign space exploration (DSE) [12], etc.Even though the previous effort shows great effectiveness, thereare several principal challenges unaddressed. 1) The high-level ab-straction of programming styles in HLS, such as loops and functioncalls, sometimes performs poorly for hardware implementation,and conceals further optimization opportunities [7]. To overcomethis problem, Licht JD et al. [7] propose a toolbox, from which
GPP: G NN based p erformance p redictor RLMD: R einforcement l earning based m ulti ‐ objective D SE Original data flow graph (DFG)HLS
IR DFG with optimal directivesTransformed
HLS
C/C++Application
HLS
C/C++ SynthesizedRTLUser defined resource constraints
IronMan: the proposed framework
HLS tool CT : c ode t ransformer Inputs Outputs
Figure 1: The overall framework, IronMan, is integrated ofCT, GPP, and RLMD. GPP is a highly accurate graph-neural-network-based performance and resource predictor; RLMDis a reinforcement-learning-based multi-objective DSE en-gine for resource allocation; CT is a code transformer thatextracts the data flow graph from original HLS C/C++ andgenerates synthesizable code with new HLS directives. developers can choose multiple classes of transformations with dif-ferent objectives such as increasing parallelism. Even provided withtransformation tools, manual efforts for exploration are still needed.2) Existing DSE approaches as well as commercial HLS tools do notprovide flexible trade-off (Pareto) solutions between different objec-tives and constraints (e.g., latency and resource usage, or differenttypes of resources). As pointed out by Schafer et al. [12], existingapproaches usually sacrifice design latency for less resource, or viseversa. However, one alternative is to trade one type of resource foranother (e.g., LUT and DSP in FPGA) while maintaining the latency,which is unexplored and only can be done through tedious manualefforts (Detailed examples will be provided in Section 2). 3) The realquality of the resulting RTL designs is hard to predict, especially forirregular data paths. As most existing predictors are model-based(e.g., COMBA [18]), they are more suitable for well-structured dataflows such as perfect or semi-perfect loops. Another approach isapplying machine learning for predictions [3, 8, 19], which oftenrequires to extract abundant features after design synthesis and/orimplementation.In this work, we propose an end-to-end framework, namely
IronMan , to address the aforementioned challenges. There aretwo major goals in this framework: 1) to enable a flexible and au-tomated DSE, aiming to explore various trade-offs between different a r X i v : . [ c s . A R ] F e b bjectives such as resource types and latency , which is not achiev-able through existing HLS tools or DSE engines. 2) to provide anaccurate RTL design performance predictor, which does not requireany additional features besides the original data flow graph (DFG), supporting both regular and irregular data paths . Fig. 1 illustratesthe overall flow of IronMan , composed of three components thatseamlessly cooperate with each other. We briefly introduce thecomponents and summarize our contributions as follows. • GPP : we propose a highly accurate G raph neural network (GNN)based p erformance p redictor of HLS designs (for both regular andirregular data paths), including resource utilization (DSPs andLUTs) and critical path (CP) timing. Notably, it predicts the actualperformance after physical synthesis (placement and routing)rather than the synthesized results by HLS tools. Moreover, GPPcan generalize to previously unseen DFGs. • RLMD : we propose a deep r einforcement l earning based m ulti-objective d esign space exploration engine for resource allocationin HLS. Assisted by GPP , RLMD explores the optimal resourceallocation strategy based on user-specified constraints. The objec-tives include minimizing resource utilization, optimizing criticalpath timing and/or minimizing DFG computation latency.
RLMD also provides Pareto solutions between different objectives thatare unavailable in HLS tools. • CT : we propose a c ode t ransformer, which extracts the DFGsfrom original HLS C/C++ and re-generates synthesizable codewith HLS directives after RLMD applies DSE to obtain optimaldirectives. CT reveals concealed optimization opportunities forachieving higher parallelism and shorter latency, and allows flexible and finer-grained DSE under user-specified constraints. • IronMan : while each proposed component alone can indepen-dently contribute to the HLS community (performance prediction,DSE, code transformation), we integrate them into a framework,IronMan, and demonstrate the end-to-end benefits on bench-marks from real-world applications. • The experimental results show that, 1)
GPP achieves high pre-diction accuracy in actual resource, reducing prediction errorsof HLS tools by 10 . × in resource utilization and 5 . × in criticalpath timing; 2) RLMD obtains optimal or Pareto solutions thatoutperform the generic algorithm and simulated annealing by12 .
7% and 12 . Iron-Man is able to find solutions satisfying various DSP constraints,with 2 . × fewer DSPs and up to 6 × shorter latency than thoseof HLS tools, while being up to 400 × faster than the heuristicalgorithms and HLS tools. HLS tools have been extensively studied for years with great achieve-ments alleviating the hardware design burdens. There are manyimportant investigations focusing on the optimization of HLS tools,performance predictions, and DSE technologies [3, 7, 8, 12, 18–20].Regardless of these achievements, however, we have the followingobservations that reveal insufficiency of the existing HLS tools.
1. Higher level abstractions in HLS can obstruct optimiza-tion opportunities. (1) The irregular logic, cascaded and/or im-perfect loops in HLS coding usually require manual or complicated o f D S P s Vivado HLS Default SolutionDSP-LUT Pareto Solutions7.5 8.0 8.5 9.0 9.5
Critical Path Timing (ns) o f D S P s Vivado HLS Default SolutionDSP-CP Pareto Solutions 6.57.07.58.08.59.09.5 C r i t i c a l P a t h T i m i n g ( n s ) o f L U T s Figure 2: Pareto solutions between DSPs and LUTs on aFPGA. The default HLS solution is not on the Pareto fron-tier, and it is non-trivial to obtain the Pareto solutions in alarge design space.
Orig. Code: for (int i=0; i <
8; i++) sum += a[i] ∗ b[i];Method Cycles DSP LUTs CP (ns) ∗ ⊲ CT + resource (5 Mul_LUT) 2 2 1742 4.2410 ⊲ CT + resource (4 Mul_LUT) 2 2 1741 4.0111 ⊲ CT + resource (3 Mul_LUT) 2 3 1461 3.98 ∗ HLS pragmas do not always behave as expected.
Table 1: Approaches to meeting DSP constraints (e.g., ≤ )of the above for_loop , with different clock cycles (latency),numbers of LUTs, and critical path (CP) timing. This workexplores CT + resource approaches. code transformations to improve performance [7]. (2) The struc-tured HLS coding style hinders advanced or fined-grained perfor-mance and resource optimization. Table 1 demonstrates a simplemultiplication-accumulation function using a for-loop. To explorethe trade-offs between the DSP usage and the number of clock cy-cles (latency), the typical ways are to use unroll pragmas or manualloop-tiling, such as line 1-4. However, when the loop boundary (e.g.,8) is not divisible by the DSP constraint (e.g., 3), it results in a partialunrolling as line 4, introducing undesired latency increment (from4 to 8) and worsening the critical path (CP) timing (from 5ns to7.4ns). The nested loops further complicate this problem and makeit much harder to balance between latency and resource (imaginea 5-layer nested loop with a DSP constraint of 17). Motivated bythe necessity of better performance and more flexible optimizationchoices , we propose a code transformer (CT) , which breaks upthe high-level abstractions and their boundaries, fuses them into anew data flow graph (DFG), and re-generates synthesizable C/C++code with pragmas. An example of CT is given in Fig. 3 (a). CTeasily allows to use directives, such as allocation and resource prag-mas, to conduct finer-grained DSEs for resource and performance,as shown in Table 1 from line 7-11. Notably, using
CT+resource approaches proposed in this work (line 9-11) achieves best latency i.e., 2) within the DSP constraint (i.e., ≤
3) without manual efforts.An alternative to constrain DSP is to use but at the cost of increased latency (line 5-8).
2. HLS tools do not always provide the best solution, norautomatically provide trade-offs (Pareto solutions).
Given aDFG, either original or generated by CT, it requires an extensiveDSE to explore the fine-grained trade-offs among performance,resource usage, and resource type (e.g, DSPs v.s. LUTs). Fig. 2 ex-amples the Pareto solutions between LUTs and DSPs, achieved byspecifying that certain multiplications use LUTs instead of DSPs.The red stars are the Pareto solutions and the magenta dot is thesingle HLS default solution. The input DFG has 200 operations andis synthesized by Vivado HLS and implemented by Vivado. In thisexample, first, the HLS default solution is not on the Pareto frontier;second, it demonstrates a large design space for finding the Paretosolutions, and thus the
DSE for Pareto solutions is non-trivial .Worth noting, the solution space grows exponentially even for abinary selection between DSP and LUT for each multiplication,and the different data precision (bit-width) further complicates thisproblem. Motivated by the necessity and difficulty of a flexible andfine-grained DSE , we propose a deep reinforcement learning basedDSE tool,
RLMD .
3. Existing HLS performance predictors and DSE tools donot support irregular logic and data paths.
Instead, they mostlytarget highly regular data paths, e.g., perfect and nested loops withhigh-level directives such as pipelining and array partition [18,20]. As such, these analytical model based predictors and highlevel DSE tools are not suitable for irregular logic and data paths (especially for timing estimation). Fortunately, the inherent graphstructure of DFGs provides a promising opportunity to exploit therepresentative power of GNNs [4, 6, 11]. Motivated by the necessityof
DSE and high-accuracy prediction for irregular data path andthe intrinsic graph structure of DFGs , we propose a GNN-basedHLS performance predictor,
GPP, enabling RLMD for DSE onarbitrary DFGs . The overall framework of IronMan is shown in Fig. 1. The inputsto IronMan are HLS C/C++ code and user-specified constraints.The outputs are re-generated multi-objective-optimized code withproper HLS directives, either meeting the user-specified constraints(e.g., resource or latency), or providing Pareto solutions betweendifferent optimization objectives. ○ CT extracts the intrinsic DFGs from the intermediate repre-sentation (IR) of HLS tools to release more optimization opportu-nities, and then re-generates synthesizable C/C++ code with op-timized DFGs. An example of how CT re-generates C++ code isshown in Fig. 3 (a), with the extracted DFG in (b). Each intermediateoperator may have various bit-width, e.g., ⟨ ⟩ means a 12-bit dataprecision. ○ GPP, a GNN-based performance predictor , estimates ac-tual resource usage after physical synthesis of the DFGs. GNNs [4,6, 11] are adopted for three reasons. (1) DFGs are graphs, whichare naturally suitable for GNNs to learn the underlying informa-tion from graph structures. (2) DFGs vary in topologies and sizes,and in order to generalize predictions to new or unseen graphs, (cid:3400) (cid:3400)(cid:3400) i1<12> (cid:3400) i2<10> i3<16> i4<11> i5<12> i6<8> i7<16> i8<9>m1<12> m2<16> m3<8> m4<12>m5<12> m6<10>o1<16> (cid:3400) (cid:3400)(cid:3400) i1<12> (cid:3400) i2<10> i3<16> i4<11> i5<12> i6<8> i7<16> i8<9>m1<12> m2<16> m3<8> m4<12>m5<12> m6<10>o1<16>
Operations using
DSPsOperations using
LUTs
HLS resource variable=m3 core=Mul_LUT
HLS resource variable=m4 core=Mul_LUT (cid:3400) (cid:3400)(cid:3400) i1<12> (cid:3400) i2<10> i3<16> i4<11> i5<12> i6<8> i7<16> i8<9>m1<12> m2<16> m3<8> m4<12>m5<12> m6<10>o1<16> • DSP: • Latency:
HLS solution with constraints(d)
IronMan solution
HLS allocation instance=mullimit= operation • DSP: • Latency: • DSP: • Latency:
HLS default solution 𝑚 (cid:4670)
0, 0, 1, 0, 0, 1, 0, 1, 1, 1 (cid:4671)𝑚 (cid:4670)
0, 1, 0, 0, 0, 1, 0, 0, 1, 0 (cid:4671)
Examples of ‐ dimension node feature vector: th : w/wo pragma5 th ~9 th : Bit ‐ width ( st ~ th : Input / node type ( (cid:3397) , (cid:3400) ) / output (cid:3397)(cid:3397)(cid:3397)(cid:3397) (cid:3397) (cid:3397) (cid:3397) (cid:3397) (cid:3397)(cid:3397) for ( int i=0; i<4; i++) sum += a[i] * b[i]; TYPE1 m1 = i1 * i2; TYPE2 m2 = i3 * i4; TYPE3 m3 = i5 * i6; TYPE4 m4 = i7 * i8; TYPE5 m5 = m1 + m2; TYPE6 m6 = m3 + m4; TYPE7 m7 = m5 + m6; HLS resource variable = m3 core = Mul_LUT
Original
Code:Transformed
Code with resource pragma * The
TYPE can be varied data precisions (bit ‐ width), which further expands the design space (e.g., from INT2 to INT32). (a)
Code
Transformation (CT)
Figure 3: An example of IronMan solution. (a) The originalHLS code and transformed code with resource pragma, indi-cating the importance of CT for IronMan solutions. (b) HLSdefault solution with 4 DSPs and a latency of 3; (c) HLS so-lution with naive constraints, with 2 DSPs while increasinglatency from 3 to 4; (d) IronMan solution, with 2 DSPs andan unchanged latency of 3. it is necessary to use inductive
GNNs [4] to learn fixed-size graphembeddings . (3) IronMan runs inferences of trained GNN modelsduring execution, which is orders of magnitude faster than runningHLS tools. ○ RLMD, an RL-based DSE engine , takes the DFG, its cor-responding graph embedding, and user-specified constraints asinputs, to make endeavors for optimal resource allocation strategy.RL is adopted for two main reasons. (1) The design space growsexponentially with the size of DFGs, different graph topologies,and various data precision. Whereas an RL agent can explore de-sign space proactively and learn from past experiences, and a wellpre-trained agent can generalize to new problems by minimal fine-tuning efforts. (2) By carefully defining reward functions, RL agentscan achieve multi-objective optimization automatically, getting rid f manual efforts to craft useful heuristics. Noteworthy, an informa-tive and well-crafted state representation will significantly benefitthe learning process in RL problems, motivating the integration ofGPP and RLMD, where the graph embeddings are naturally suitablefor state representation in this problem. Consequently, the graphembeddings enable RLMD to generalize across different DFG topolo-gies, and GPP largely accelerates the training process of RLMD byquick evaluations of solutions generated by RLMD.As a case study of IronMan, the specific problem solved is to find a resource allocation solution that strictly meets theDSP constraint, or to find Pareto solutions between DSPsand LUTs on FPGAs, without sacrificing DFG computationlatency. For simplicity, the DFGs only have additions and multi-plications, where RLMD decides whether to assign the directive to each multiplication opera-tion, to minimize LUTs within DSP constraints. As shown in Fig. 3,(b) is the default Vivado HLS solution with 4 DSPs and a latency of3; (c) is a naive solution using to enforce two DSPs, resulting in an increased latency from3 to 4; (d) shows the solution of IronMan with 2 DSPs and anunchanged latency of 3, using ⟨ var ⟩ core=Mul_LUT . Notably, such a finer-grained DSE of IronMan isenabled by CT. Since GPP provides inputs and performance predictions of RLMD,we first introduce the structure of GPP, and then discuss the detailedformulation of RLMD.
The key role of a GNN is to extract adequate information of nodetypes, graph topology and connectivity within a large DFG, andencode the information into low-dimensional vector representa-tions that can be used for either downstream tasks or high-accuracyperformance prediction.
Node Feature Vector.
In a DFG, each node is encoded into a10-dimension node feature vector, as the example shown in Fig. 3(d).The 1 𝑠𝑡 to 4 𝑡ℎ dimension use one-hot representations to encode thenode types, including input nodes, intermediate nodes/operations(additions and multiplications), and output nodes. The 5 𝑠𝑡 to 9 𝑡ℎ dimension encode the data precision of an intermediate operation,which in this work ranges from INT2 to INT32. We use a binaryrepresentation to encode the precision minus one, so the bit-widthcan be expressed in 5 bits. The 10 𝑡ℎ dimension indicates whetheran HLS directive is applied to this node. Notethat such an encoding scheme can be easily extended to supportmore types of nodes/operations or pragmas. Graph Embedding.
We employ three GNN models to sepa-rately predict the number of utilized LUTs, DSPs, and CP timing.The three models have the same structure, as illustrated in theleft part of Fig. 4. For each GNN model, the inputs are adjacencymatrices and node feature matrices of DFGs. The first two layersare graph convolutional [6], with 64 and 128 units respectivelyand ReLu activations. In each graph convolutional layer, the nodeembedding is updated by aggregating feature vectors from its neigh-bors, and one node can receive information from farther nodes by stacking multiple layers. Next, the learned node embeddings aresummarized by a mean pooling to create a graph representation,denoted by a 128 × 𝛼 = .
1) activations to generate a graph embed-ding, denoted by a 64 × Integration with RLMD.
To integrate with RLMD, we combinethe three embedding vectors that focus on different characteristicsof DFGs into one graph embedding, shown as the 192 × RL Formulation.
The resource allocation problem in HLS, as atypical RL [13] problem, can be formulated as a Markov DecisionProcess (MDP), with four key components. • States: the set of possible states. In this problem, a state can beevery possible partially assigned DFGs. • Actions: the set of eligible actions under a state. In this problem,given the current state and the currently considered node of theDFG, the available action is whether to assign a certain directiveto this node. • State transition: given a state and an action, the probability dis-tribution of next states. • Reward: the immediate reward of taking an action in a state. Inthis problem, the reward is 0 for all intermediate actions, with anexception for the last action where the reward is the evaluationof the fully assigned DFG subject to user-specified constraints.Specifically, the state at time step 𝑡 is defined as 𝑠 𝑡 , which is a con-catenation of features including a 192 × 𝑎 𝑡 is a valid assignment of a directive to the 𝑡 𝑡ℎ node, i.e., whether to use LUTs for multiplication computationon this node. We define the reward 𝑟 𝑡 as a negative weighted sumof predicted LUTs, CP timing, and the difference between predictedand targeted DSPs, as follows: 𝑟 𝑡 = (cid:40) − 𝛼𝐿𝑈𝑇 𝑝 − 𝛽 | 𝐷𝑆𝑃 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝐷𝑆𝑃 𝑝 | − 𝜆𝐶𝑃 𝑝 , 𝑡 = 𝑇 , < 𝑡 < 𝑇 . (1)where 𝛼 , 𝛽 and 𝜆 are hyper-parameters.At the initial state 𝑠 , all the multiplication nodes in a DFG areunassigned. At each time step 𝑡 , the RL agent observes the currentstate 𝑠 𝑡 , takes an action 𝑎 𝑡 , receives a reward 𝑟 𝑡 + and arrives at anew state 𝑠 𝑡 + . The nodes are assigned with directives sequentiallybased on their node IDs. Given 𝑇 multiplication nodes in total, the igure 4: GPP and RLMD structure. GPP encodes information of the DFG adjacency and node features, to make predictions ofLUT/DSP/CP. Concatenating the graph embedding provided by GPP with the meta data of the input DFG, RLMD then outputsa binary probability distribution 𝜋 ( 𝑎 𝑡 | 𝑠 𝑡 ) of whether to use LUTs for multiplication computation on the current node, and ascalar as the state-value function. final state 𝑠 𝑇 corresponds to a DFG completely assigned with properdirectives. The goal is to maximize the expected rewards received. RLMD Training.
We adopt the actor-critic method with Monte-Carlo learning [13]: the actor aims to learn an optimal policy 𝜋 𝜃 ( 𝑎 𝑡 | 𝑠 𝑡 ) parameterized by 𝜃 , which is a probability distribution ofvalid actions under the current state; the critic approximates thestate-value function 𝑉 ( 𝑠 𝑡 ) = E 𝜋 [ (cid:205) 𝑇 − 𝑡𝑘 = 𝛾 𝑘 𝑟 𝑡 + 𝑘 | 𝑠 𝑡 ] by parameters 𝑤 , which is an estimate of total rewards starting from state 𝑠 𝑡 to 𝑠 𝑇 following policy 𝜋 , measuring the goodness of this state. The 𝛾 ∈ ( , ] is the discount factor. As shown in the right part of Fig. 4,there are shared parameters in the actor 𝜋 𝜃 and the critic 𝑉 𝑤 , andfor clarity we denote 𝜃 and 𝑤 separately. By Monte-Carlo learning,the parameters are updated once only after one complete episode(i.e., one complete assignment process of a DFG), leading to theupdates as follows: 𝛿 𝑖 = 𝛾 𝑇 − 𝑖 𝑟 𝑇 − 𝑉 𝑤 ( 𝑠 𝑖 ) , (2) Δ 𝑤 ∝ 𝑇 ∑︁ 𝑖 = 𝛿 𝑖 ∇ 𝑤 𝑉 𝑤 ( 𝑠 𝑖 ) , (3) Δ 𝜃 ∝ 𝑇 ∑︁ 𝑖 = 𝛿 𝑖 ∇ 𝜃 log 𝜋 𝜃 ( 𝑎 𝑖 | 𝑠 𝑖 ) , (4)where 𝑇 is the total time steps in one episode. Through repeatedepisodes (i.e., sequences of states, actions, and rewards), the actorlearns optimized policy that will maximize cumulative rewards.Our ultimate goal is to enable RLMD to generate higher-qualityresults and transfer knowledge across various DFGs as it gains ex-perience from predicting resource allocation strategies on more andmore DFGs. Thus, we formally formulate the overall optimizationobjective function as: J ( 𝜃, 𝑤, 𝐺 ) = 𝐾 ∑︁ 𝑔 ∈ 𝐺 E 𝑔,𝑙 ∼ 𝜋 𝜃 [ 𝑅 𝑔,𝑙 ] , (5) where J ( 𝜃, 𝑤, 𝐺 ) measures the total expected rewards over alltraining DFGs. The DFG dataset 𝐺 has 𝐾 different DFGs, each ofwhich is denoted as 𝑔 . 𝑅 𝑔,𝑙 is the episode reward (i.e., 𝑟 𝑇 defined inEq.(1)) under the resource allocation solution 𝑙 on the DFG 𝑔 . To getbetter exploration during training, we apply 𝜖 -greedy algorithmfor action selections [13]. RLMD Fine-tuning.
Given an unseen DFG, the simplest wayis to directly apply the pre-trained RLMD for inference, which cangenerate a solution within a second. When higher quality solutionsare expected, the pre-trained RLMD can be further finetuned onthis particular DFG with the help of by GPP. The fine-tuning stepprovides the flexibility to trade off between a quick solution usingthe pre-trained RLMD (which has learned rich knowledge of re-source allocation strategies on other DFGs) and a longer yet betterone for a particular DFG.
Dataset Generation.
In order to train GPP and RLMD, we builda dataset containing both synthetic and real-case DFGs, whichare generated by our CT. We randomly create 47 different syn-thetic DFGs, each of which has 100 to 200 operations of eithermultiplication or addition. Each operation has two operands withrandom data precision between INT2 and INT16. On each DFG,we further generate up to 100 sets of directives, which specify asubset of multiplications to be implemented by LUTs rather thanDSPs. For real-case DFGs, we consider 8 benchmarks from Mach-Suite [10], CHStone [5] and PolyBench/C [9]: gemm, kernel_2mm,kernel_durbin (small, large), spmv, stencil3d (small, large), and ker-nel_adi . The ground-truth (actual) resource usage (LUT/DSP) andcritical path timing are synthesized by Vivado HLS [17] and imple-mented by Vivado [15] targeting Xilinx Ultra96 part xc7z020clg484.
Training Process.
To demonstrate the generalization capabilityof IronMan across different DFGs and applications, GPP and RLMDare trained on part of DFGs from the dataset and evaluated on rest x x x x x x Actual LUTs x x x x x x P r e d i c t e d L U T s MAPE:HLS_syn=122.4% HLS_real=92.2% GPP_syn=9.2% GPP_real=7.4%
LUT
Actual DSPs P r e d i c t e d D SP s RMSE:HLS_syn=26.9 HLS_real=19.7 GPP_syn=2.1 GPP_real=5.6
DSP
Actual CP Timing (ns) P r e d i c t e d C P T i m i ng ( n s ) MAPE:HLS_syn=7.7% HLS_real=42.1% GPP_syn=4.2% GPP_real=4.6% CP Vivado HLS -- synthetic DFGsVivado HLS -- real benchmarksGPP -- synthetic DFGsGPP -- real benchmarks
Figure 5: GPP predictions on resource utilization (LUTs and DSPs), and critical path timing (CP timing). o f D SP s LUT utilization:RLMD over SA: -11.8%RLMD over GA: -12.5%RLMD-FT over RLMD: -12.0% case 1 o f D SP s LUT utilization:RLMD over SA: -11.6%RLMD over GA: -8.7%RLMD-FT over RLMD: -11.5% case 2 o f D SP s LUT utilization:RLMD over SA: -12.3%RLMD over GA: -14.1%RLMD-FT over RLMD: -11.8% case 3 o f D SP s LUT utilization:RLMD over SA: -15.1%RLMD over GA: -18.7%RLMD-FT over RLMD: -12.7% case 4 o f D SP s LUT utilization:RLMD over SA: -13.2%RLMD over GA: -11.8%RLMD-FT over RLMD: -11.1% case 5 o f D SP s LUT utilization:RLMD over SA: -12.1%RLMD over GA: -11.6%RLMD-FT over RLMD: -10.4% case 6
RLMD RLMD-FT GA SA Vivado HLS Defualt Solution
Figure 6: Pareto solutions found by RLMD, SA and GA onsynthetic DFGs, with unchanged latency. of them, where 41 different synthetic DFGs and 4 real-case DFGs( kernel_durbin and stencil3d ) compose the training set.GPP is trained via regression to minimize the mean squared loga-rithmic errors for DSPs and CP timing, and the mean absolute errorsfor LUTs, respectively. The supervised learning process and GNNmodels enable GPP to identify features and necessary informationto generalize performance prediction and graph embeddings acrossdifferent DFGs. In terms of hyper-parameter selection, the GPP istrained over 200 epochs with batchsize as 32. The Adam optimizer isapplied with an initial learning rate of 0.01 decaying exponentially.Once GPP is trained, it is integrated with RLMD and the trainingprocess of RLMD can be started. To train RLMD, we provide tu-ples in the form of [ 𝐷𝐹𝐺 𝑖𝑛𝑑𝑒𝑥 , 𝐷𝑆𝑃 𝑡𝑎𝑟𝑔𝑒𝑡 ] to the RL agent, and theoptimization goal is to maximize the average cumulative rewardson all of the tuples, so that the agent can learn resource allocation strategies under different DSP constraints and across different DFGs.There have 1125 different tuples in total, and each tuple appears 8times during the training process, amounting to 9,000 episodes. Asfar as fine-tuing the RLMD on a particular DFG, we additionallyconduct 500 episodes of training with various targeted DSPs. Theparameters in RLMD are also learned by Adam optimizer with thelearning rate of 0.008. We empirically set the discount rate 𝛾 = . 𝜖 = .
08 decaying exponentially. In thereward function, we have 𝛼 = . , 𝛽 =
5, and 𝜆 = . Baselines.
To evaluate GPP, we compare with the HLS tool VivadoHLS [17] and the machine learning based circuit performance pre-dictor Pyramid [8]. To evaluate IronMan, we compare with threeoptimization methods: 1) genetic algorithm [16] (denoted by GA),one prominent instance of evolutionary optimization algorithms; 2)simulated annealing [14] (denoted by SA), an effective technique toapproximate the global optima in an extremely large search space;3) Vivado HLS, the default solutions provided by the Vivado HLStool [17]. Solutions from different optimization methods are finallysynthesized and implemented by Vivado to derive the actual perfor-mance. Worth noting, none of the state-of-the-art DSE methods forHLS [12] can do fine-grained resource allocation as we proposed inthis work, because of the code transformation we proposed.
GPP vs. HLS tool and Pyramid [8].
GPP is evaluated on bothsynthetic DFGs and real-case DFGs. Fig. 5 compares GPP predictionswith HLS synthesis reports regarding LUT, DSP and critical pathtiming. For LUT utilization, the mean absolute percentage errors(MAPEs) of GPP on synthetic and real-case DFGs are 7 .
4% and 9 . .
4% and92 . .
2% and 4 .
6% on synthetic and real-caseDFGs, whereas the MAPEs of Vivado HLS are 7 .
7% and 42 . . × in resource and 5 . × in timing.Pyramid [8] is also an ML-based framework for resource andtiming prediction. The major difference between GPP and Pyramidis the features required for predictions. Pyramid needs 72 featuresfrom HLS reports as inputs, which enforce the running of HLS toget VHDL designs, possibly consuming hours for large designs; Target DSPs o f D SP s kernel_2mm
30 38 46 54 62 70 78 86 94 102 110
Target DSPs o f D SP s gemm
13 19 25 31 37 43 49 55
Target DSPs o f D SP s spmv
19 25 31 37 43 49 55 61 67 73 79
Target DSPs o f D SP s kernel_adi k k k k k k o f L U T s N o r m a li z e d L a t e n cy k k k k k k o f L U T s N o r m a li z e d L a t e n cy k k k k k o f L U T s N o r m a li z e d L a t e n cy k k k k k k k o f L U T s N o r m a li z e d L a t e n cy DSP targetVivado HLS - DSPGA - DSP SA - DSPRLMD - DSPRLMD-FT - DSP Vivado HLS - LUTGA - LUTSA - LUT RLMD - LUTRLMD-FT - LUT Vivado HLS - LatencyGA/SA/RLMD(-FT) - Latency
Figure 7: IronMan performance on real-case benchmarks. IronMan meets DSP constraints for 39 out of 40 cases, and meetsall 40 after fine-tuning, while SA, GA and Vivado HLS meet for 2, 6, and 0 cases, respectively. With . × and . × less DSPusage, IronMan after fine-tuning uses . and . less LUTs than SA and GA, respectively. Meanwhile, IronMan alwaysmaintains the shortest latency while Vivado HLS increases the latency by up to × . whereas GPP can make high-accuracy predictions simply from rawDFGs (within a second). Pyramid considers four ML models and anensemble of these four, none of which includes graphical structure.The reported results show that the averaged prediction error of asingle ML model is 17 .
8% for resource and 17 .
3% for timing, withthe ensemble reaching 5 .
5% for resource and 4 .
1% for timing.
RLMD vs. GA and SA.
Fig. 6 compares RLMD with GA andSA regarding the Pareto solutions between LUTs and DSPs. Obvi-ously, RLMD outperforms GA and SA by a large margin. Given thesame number of DSPs, RLMD is capable to find solutions reducingthe LUT utilization by 12 .
7% and 12 . . IronMan vs. All.
As depicted in Fig. 7, IronMan is fully eval-uated by comparing with SA, GA and Vivado HLS on four real-casebenchmarks, kernel_2mm , gemm , spmv , and kernel_adi , whose sizesare 2 − × larger than synthetic DFGs. To showcase IronMan ca-pable to perfectly satisfy user specifications without sacrificinglatency, we specify different DSP constraints within 20% to 80% ofthe maximal number of DSPs for each case.Among four real-case benchmarks with 40 different DSP con-straints in total, IronMan is able to meet 39 (97.5%) of them, and can further improve to 40 (100%) by fine-tuning. Whereas SA, GAand Vivado HLS only meet the constraints for 2 (5%), 6 (15%) and0 cases, respectively. Specifically, IronMan on average consumes98.5% of the targeted DSPs (improved to 99.3% with fine-tuning),whereas those found by SA, GA and Vivado HLS use 1.29 × , 1.43 × ,and 2.54 × more DSPs than the target, respectively. Not only canIronMan meet user-specified constraints much more accuratelythan its counterparts and HLS tools, but it always maintains theshortest latency whereas Vivado HLS results in an increased latencyby up to 6 × .It is noteworthy that reducing DSPs without sacrificing latencyis achieved at the cost of increased LUT usage. The reason fordoing so is that DSP resource is usually more critical in FPGAswhile LUT resource is more adequate. By perfectly satisfying DSPconstraints (with 1 . × and 1 . × fewer DSPs), IronMan slightlyincreases the LUT usage by 1.2% and 3.0%, compared with SA andGA; with further fine-tuning, IronMan achieves additional 8.9%LUT reduction, resulting in 7.7% and 5.9% lower LUT usage thanSA and GA. Execution Time.
During inference, i.e., being applied on realapplications, IronMan only takes a few seconds (up to 10s) to getresource allocation decisions, as well as an accurate resource andtiming prediction. On the other hand, Vivado HLS takes tens ofminutes to synthesis the C++ code, and takes up to hours to get theexact resource usage after implementation, which is extremely timeconsuming especially for large applications. The SA and GA alsotake hours in average, because they struggle to exactly or closelymeet the DSP constraint, and cannot generalize across differentDFGs or DSP constraints. Since the fine-tuning for RL agent is toprovide the flexibility to balance between a quick solution usingthe pre-trained model and a longer yet better one for a particularDFG, it is optional, and the number of episodes for fine-tuning isadjustable based on users’ requirements. CONCLUSION
IronMan is an end-to-end framework, aiming to help HLS toolsgenerate higher quality solutions under user-specified constraints,or perform more flexible DSEs to provide Pareto solutions that arenot currently supported by HLS tools. IronMan is equipped witha GNN-based performance predictor GPP, an RL-based DSE engineRLMD, and a code transformer. Independently, GPP achieves highprediction accuracy, reducing prediction errors of HLS tools by 5 . × in timing and 10 . × in resource; RLMD obtains Pareto solutionsthat outperform the GA and SA by 12 .
7% and 12 . . × fewer DSPs and up to 6 × shorter latency than those of HLS tools. IronMan is also up to400 × faster than the heuristic algorithms and HLS tools. REFERENCES .[3] Steve Dai et al. 2018. Fast and accurate estimation of quality of results in high-level synthesis with machine learning. In
FCCM .[4] Will Hamilton et al. 2017. Inductive representation learning on large graphs. In
NeurIPS .[5] Yuko Hara et al. 2009. Proposal and quantitative analysis of the CHStone bench-mark program suite for practical C-based high-level synthesis.
JIP
17 (2009),242–254. [6] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graphconvolutional networks.
ICLR (2017).[7] Johannes de Fine Licht et al. 2018. Transformations of High-Level SynthesisCodes for High-Performance Computing. arXiv:1805.08288 (2018).[8] Hosein Mohammadi Makrani et al. 2019. Pyramid: Machine Learning Frameworkto Estimate the Optimal Timing and Resource Usage of a High-Level SynthesisDesign. In .[9] Louis-Noël Pouchet and Tomofumi Yuki. 2016. PolyBench/C - the PolyhedralBenchmark suite. http://web.cs.ucla.edu/~pouchet/software/polybench/.[10] Brandon Reagen et al. 2014. MachSuite: Benchmarks for Accelerator Design andCustomized Architectures. In
IISWC .[11] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and GabrieleMonfardini. 2008. The graph neural network model.
IEEE Transactions on NeuralNetworks
20, 1 (2008), 61–80.[12] Benjamin Carrion Schafer and Zi Wang. 2019. High-level synthesis design spaceexploration: Past, present and future.
IEEE TCAD (2019).[13] Richard S Sutton and Andrew G Barto. 2018.
Reinforcement learning: An intro-duction . MIT press.[14] Peter JM Van Laarhoven and Emile HL Aarts. 1987. Simulated annealing. In
Simulated annealing: Theory and applications . Springer, 7–15.[15] Vivado. Accessed: 2021-02-14.
Vivado Design Suite - HLx Editions
Statistics and computing
4, 2(1994), 65–85.[17] Xilinx. Accessed: 2021-02-14.
Xilinx Vivado High-Level Synthesis
ICCAD .[19] Jieru Zhao et al. 2019. Machine learning based routing congestion prediction infpga high-level synthesis. In
DATE .[20] Guanwen Zhong et al. 2016. Lin-analyzer: a high-level performance analysis toolfor FPGA-based accelerators. In .[21] Wei Zuo et al. 2013. Improving high level synthesis optimization opportunitythrough polyhedral transformations. In
FPGA ..