Characterizing and Optimizing EDA Flows for the Cloud
CCharacterizing and Optimizing EDA Flows for the Cloud
Abdelrahman Hosny Sherief Reda
Department of Computer Science School of EngineeringBrown University Brown Universityabdelrahman [email protected] sherief [email protected]
Abstract— Cloud computing accelerates design space explo-ration in logic synthesis, and parameter tuning in physicaldesign. However, deploying EDA jobs on the cloud requiresEDA teams to deeply understand the characteristics of theirjobs in cloud environments. Unfortunately, there has been littleto no public information on these characteristics. Thus, in thispaper, we formulate the problem of migrating EDA jobs tothe cloud. First, we characterize the performance of four mainEDA applications, namely: synthesis, placement, routing andstatic timing analysis. We show that different EDA jobs requiredifferent machine configurations. Second, using observationsfrom our characterization, we propose a novel model based onGraph Convolutional Networks to predict the total runtime ofa given application on different machine configurations. Ourmodel achieves a prediction accuracy of 87%. Third, we developa new formulation for optimizing cloud deployments in orderto reduce deployment costs while meeting deadline constraints.We present a pseudo-polynomial optimal solution using a multi-choice knapsack mapping that reduces costs by 35.29%.
I. I
NTRODUCTION
EDA (Electronic Design Automation) tools expose hundredof parameters to tune for front-end and back-end design.Therefore, design space exploration and efficient physicalimplementation have been increasingly challenging and re-quire a massive amount of compute to achieve acceptableQuality of Results (QoR). In the recent years, there hasbeen a growing trend among EDA teams to utilize elasticcompute environments (i.e. cloud) to gain near-instant accessto compute resources [1]. Migrating EDA jobs to the cloudhas helped teams meet the demands of their tapeout schedule,hence reducing the time to market [2]. For example, horizontalscaling by launching more compute servers allows EDAteams to complete a highly-parallelizable compute job suchas simulation in less time. In addition, EDA teams have theflexibility to choose the configuration of hardware that meetstheir needs for the exact pending job and only for the timeneeded to complete it.However, migrating EDA jobs to the cloud is not astraightforward path, especially for teams with little or noexperience managing elastic resources. For example, designteams need to choose the right machine configuration thatachieves the best performance for their job. While simulationand verification are known to be embarrassingly parallel (i.e.directly benefiting from the scale of the cloud), the computerequirements for the synthesis and physical design stages arenot well-studied, especially in multi-tenancy environments.
This work is partially supported by NSF grant 1814920.
Characterization
Sec. III.A
Prediction
Sec. III.B cloud instance types$ cost calculatorruntime + speedup
Optimization
Sec. III.C
Inputs - designs- tools- technology node
Outputs
EDA job cloud instance meets tapeout schedule minimim $ cost
Fig. 1: Workflow of optimizing EDA cloud deploymentsFurthermore, reducing deployment costs while meeting thetapeout schedule is a challenge, especially in teams withlimited budget.Our main contributions are:1) We characterize the performance of four EDA key appli-cations (synthesis, placement, routing, and static timinganalysis) under different machine configurations. Usingour observations, we present recommendations for theconfigurations of cloud instances to provision for eachapplication.2) We propose a novel model based on Graph ConvolutionalNetworks (GCNs) that predicts the total runtime a givenjob would take using certain machine configurations. Ourmodel achieves a high prediction accuracy of 87%.3) With recommended machine configurations and runtimepredictions for each application, we reduce cloud deploy-ment costs subject to deadline constraints by mapping theproblem to the classical multi-choice knapsack problem(NP-hard). Our implementation recommends optimal ma-chine configurations that minimizes the total cloud deploy-ment cost by an average of 35.29%.Next, we give a brief background in Section II. In SectionIII, we formulate the problem and discuss our proposedframework. In Section IV, we present our experimental results.II. B
ACKGROUND
Cloud computing refers to the elastic compute resourcesthat can be provisioned, scaled up or shutdown on demand.Cloud providers virtualize their physical infrastructure to shareprocessing time, memory, storage and network bandwidthamong many users (known as tenants). In order to achievethis virtualization, cloud vendors use a specialized softwarecalled the
Hypervisor . The Hypervisor isolates each tenant’sresources in a Virtual Machine (VM) that is accessible only byits owner. In a standard cloud offering, VMs are sold in unitsof: (i) vCPU: a virtual CPU is seen as a single CPU thread, (ii)Memory: a fixed number of memory pages is solely reservedfor the use of a VM and is expressed as the total memory size a r X i v : . [ c s . D C ] F e b ynthesisPlacement Routing STA0.00.51.01.52.02.5 B r a n c h M i ss e s ( % ) (a) Branch Misses SynthesisPlacement Routing STA010203040 C a c h e M i ss e s ( % ) (b) Cache Misses SynthesisPlacement Routing STA0102030 F l o a t i n g - p o i n t AV X O p e r a t i o n s ( % ) (c) Floating-point Operations SynthesisPlacement Routing STA0255075100125150175 R un t i m e ( m i nu t e s . ) (d) Total Runtime Fig. 2: Performance characterization of four representative EDA jobsreserved, and (iii) Storage: the size and type of the underlyingstorage device partition mounted on the VM.III. O
PTIMIZING
EDA C
LOUD D EPLOYMENTS
Problem Definition.
A fundamental question that facesEDA teams when migrating their EDA jobs to the cloud is:what configurations of VMs should be provisioned for eachjob? And how can the job completion time be reduced whileminimizing the cost?In order to answer these questions, Figure 1 draws ourworkflow that we propose in this paper. Specifically, weintroduce the following problems:
Problem 1.
What is the right VM configuration for a givenEDA job? To address this problem, we characterize four mainEDA applications, namely: synthesis, placement, routing andstatic timing analysis. We focus on characteristics that areintrinsic to the EDA job which affect the completion time.
Problem 2:
Given a design (in RTL or Netlist), predict theruntime for a given job (e.g. synthesis or routing) when using1, 2, 4 and 8 vCPUs. Our proposed prediction model, learnsinternal graph features of the design that affect the totalruntime of a given job on different machine sizes.
Problem 3:
Given the runtime for each job under 1, 2, 4 and 8vCPUs, as well as a deadline constraint, select a machine sizefor each job such that the deadline is met and the total costis minimized. We address this problem using a mapping tothe multi-choice knapsack problem and implement an optimalsolution using dynamic programming.
A. EDA Flow Characterization
To address Problem 1, we characterized the four jobs using amajor commercial EDA flow and a SPARC core design fromOpenPiton design benchmark [3] using a 14nm technologynode. We then collected the execution data from the system’shardware performance counters for further analysis. In order tosimulate a multi-tenancy environment, we used Linux ControlGroups on a machine with a 14-core Intel Xeon E5-2680processor running at 3.3GHz, and 128GB DDR4 memory.We used the Linux perf utility to instrument the hardwareperformance counters.
Branch Prediction.
Figure 2-a summarizes our findings fromthe characterization experiments. First, we observe that routinghas a higher percentage of branch misses. We attribute thisvalue to the nature of the routing algorithms, where therecan be few trials before a net is successfully routed with nodesign rule violations. In particular, graph search algorithmsin the routing step encompass a large portion of conditionalstatements that cannot be avoided. Rip-up and reroute tech-niques also contribute to halting the continuous execution ofthe routing algorithms. Sp ee d u p ( x ) dyn_nodeaesibexjpeg swervarianecoyotesparc_core Fig. 3: Routing speedup for different designs. dyn node is thesmallest and sparc core is the largest (
Memory Access Patterns.
In Figure 2-b, we observe thatplacement and routing have significantly higher cache missesthan synthesis and STA. Placement has a 45.11% cache missesrate when using 1 vCPU and 33.84% when using 8 vCPUs,while routing has 27.15% and 29.84% cache misses rateusing 1 and 8 vCPUs respectively. We attribute this highermiss rate to the nature of the analytical component in theplacement engine that tries to optimize the wirelength acrossall the chip instances using convex optimization methods. Thisneeds access to large vectors to calculate the gradients, hencebenefiting from the more cache available with more vCPUs.
Floating-point Operations.
In Figure 2-c, we observe that theplacement job requires more floating-point operations that runon Advanced Vector Extensions (AVX) hardware. This can beattributed to the analytical engine that tries to optimize thewire length across all the chip area using convex optimizationmethods. This involves calculating gradients which relies onfloating-point operations. The STA job comes next in itspercentage usage of the AVX hardware. This is consistentwith the nature of STA algorithms where calculating slacksinvolves graph traversal from inputs to outputs, with access tofloating-point values in the technology library.
Scalability and Speedup.
In Figure 2-d, we observe that therouting job scales well with more
IG/Netlist GraphAdj = Cells FeaturesX = Graph Conv GraphEmbedding FullyConnectedNN
Fig. 4: Our proposed runtime prediction model
Main Takeaways.
From the point of view of EDA teamsrunning their EDA applications on the cloud, we summarizeour main recommendations:1) Synthesis and STA jobs perform well on general-purposeVM instances with a balance between computations andmemory access. Placement and routing require VM in-stances with higher memory-to-core ratio, with routingdemanding more available L1 and LLC cache.2) Placement jobs should be run on a compute instance withan underlying processor that supports Advanced VectorExtensions (AVX). STA jobs would also benefit fromAVX hardware.3) On large designs, routing jobs scale well with the numberof vCPUs allocated. However, on small designs, speedupis capped at a certain point.These observations motivate our work in the next section.
B. Runtime Prediction
To address Problem 2, we state that the runtime of chipdesign tasks depends on a number of factors such as the designitself, the tools used, the technology node, the parametersused to instruct the tools and the VM configuration. Withoutlosing generality, when using the same tools, technology node,default parameters and VM configuration, the runtime of acertain job depends on the complexity of the design itself.Figure 4 shows the architecture of our model. The modeltakes as input the design in RTL or netlist, and performsan embedding operation using Graph Convolutional Networks[4]. After that, a fully-connected neural layer transforms theembedding into predictions for the runtime under differentmachine sizes (i.e.
Processing Input Design.
When building a model to predictsynthesis runtime, the input is usually in RTL, which is not agraph. However, synthesis tools map the RTL into an interme-diate representation such as And-Inverter Graphs (AIG) beforesynthesizing and mapping to a technology library. Therefore,our model can operate on the AIG representation of thedesign. The AIG is a Directed Acyclic Graph (DAG), whichmeans it preserves edge directions for the GCN. On the otherhand, when building a prediction model for the placement androuting, the input is expected to be a netlist. In order to operateon the given netlist, we convert cells and I/O pins into graphnodes. In addition, we convert each net into a set of directededges using the well-known star model, where there is oneedge from the driving cell (or the input pin) towards each ofthe sinks (or the output pins).
Graph Convolutions.
In Graph Convolutional Networks(GCNs), the key idea is to generate node embeddings basedon local neighborhoods. The first layer’s embedding of anode represents its input feature vector, x i . Nodes aggregateinformation from their neighbors in each convolutional layer.This aggregation is followed by an activation function, suchas ReLU , and a pooling operation, such as sum -pooling. Withthat in-place, every layer is written as a non-linear function: H ( l +1) = f ( H ( l ) , A ) , (1)where H ( l ) represents the activation at layer l , and H (0) is theinput feature matrix, X . A is the adjacency matrix. Lookingat the embedding of each node, we can elaborate on Equation1 as follows: h kv = σ ( W k (cid:88) u ∈ N ( v ) h k − u | N ( v ) | + B k h k − v ) ∀ k ∈ { , ..., K } where, h kv is a vector that represents k th -layer embedding ofnode v . N ( v ) represents the neighbors of node v . W k and B k are the trainable matrices which are shared with all nodesof the graph, and σ is the activation function. After K -layersof neighborhood aggregation, we get output embeddings foreach node that can be fed into a loss function. We can thenrun stochastic gradient descent to learn W k and B k . Model Design.
We used 2 GCN layers with 256 and 128hidden units each, followed by 1 fully connected layer with128 units. The model is trained for 200 epochs using MeanSquare Error (MSE) as a loss function and Adam as theoptimizer (lr=1e-4). The loss function calculates the combinedprediction error for all four runtimes (i.e. 1, 2, 4 and 8 vCPUs).
C. Cloud Deployment Optimization
Given the runtime estimates, we now address Problem3. Our proposed solution maps the problem to the multi-choice knapsack problem (MCKP) [5]. Using our predictionscalculated in the previous section, each job can be run ona different machine configuration (i.e. t time and costs p in total. Formulation.
Let z l ( C ) be an optimal solution defined on l applications and with total runtime constraint C : z l ( C ) := max l (cid:88) i =1 N i (cid:88) j =1 s ij p ij (2)such that, l (cid:88) i =1 N i (cid:88) j =1 s ij t ij ≤ C, (cid:88) j ∈ N i s ij = 1 , i = 1 , ...., l,s ij ∈ { , } , i = 1 , ...., l, j ∈ N i where s ij ∈ { , } denotes whether we select VM config-uration j for stage i or not, and N i denotes the number ofconfigurations in a given stage. t ij denotes the runtime of stage i when using j ’s configuration, which is obtained from theruntime predictions. Similarly, p ij denotes the cost of runningstage i when using j ’s configuration, which is obtained fromthe pricing table of the selected cloud vendor. We assume that z l ( C ) := −∞ if no solution exists (i.e. the total runtime is notsufficient to complete all the stages using the fastest machineconfiguration).
00 150 100 50 0 50 100Error (predicted - actual) of Runtime (sec.)050100150 T e s t S e t I n s t a n c e s Zero Error
Fig. 5: Runtime predictionerrors. Avg. Error: 13%.
Task Synthesis general-purpose VM
Placement memory-optimized VM
Routing memory-optimized VM
STA general-purpose VM T o t a l R un t i m e M i n C o s t ( $ ) vCPUs Runtime (sec.)
Cost ($)
TotalRuntimeConstraint
TABLE I: Minimizing total cloud deployment cost subject to a time constraint. The mark (x)denotes the recommended machine configuration. NA denotes Not Achievable.To solve (2), we implemented a pseudopolynomial solu-tion through dynamic programming utilizing Dudzinski andWalukiewicz approach [6]: z l ( C ) = max z l − ( C − t l ) + 1 /p l if ≤ C − t l ,z l − ( C − t l ) + 1 /p l if ≤ C − t l , : z l − ( C − t l nl ) + 1 /p l nl if ≤ C − t l nl This implementation provides an optimal solution providedthat the runtime values are rounded to the nearest integer(second). This is an assumption that we can safely make in ourcase since cloud machines are billed per second (no fractions).IV. E
XPERIMENTAL R ESULTS
We demonstrate our predictions using GF 14nm technologynode and a major commercial EDA flow.
Dataset.
We use 18 representative benchmarks of differentsizes and structures from EPFL benchmark suite [7] andOpenCores [8]. We synthesize each benchmark applyingdifferent logic optimizations to generate different netlists.The motivation is to challenge the GCN with netlists thathave different physical structures, but perform the same logicfunction. We have a total of 330 unique netlists, with 2,640data points (runtimes) for different machine configurations.These designs range from a few hundred instances to 200kinstances. We divide the dataset into training and test groupswith 80% and 20% respectively, where netlists of the test setbelong to unseen designs in the training set.
Prediction Accuracy.
Due to lack of space, we show a his-togram of model prediction errors for placement and routingin Figure 5. Runtime predictions given a netlist (placement,routing, STA) achieves an average error of 13%. On AIGs(synthesis), the runtime prediction has an average error of 5%.
Optimization Results.
Referring to Figure 1, our optimizationmodule takes as input the predicted runtime for a given EDAjob on certain machine configuration (Section III-B), and thecost of running the job on a machine type recommended forthat job (Section III-A). In order to calculate the cost, weobtained the pricing table for the machine configurations fromAWS at the time of this writeup, and calculated the total cost(in USD) for each EDA job (cost = runtime in hours × cost perhour). To demonstrate our optimization, we applied differentruntime constraints on predictions the of the sparc core designas shown in Table I. Our algorithm outputs the recommendedmachine configuration for each job that minimizes the totalcost subject to the given total runtime constraint. As we tightenthe time constraint, we observe that the algorithm chooseshigher machine configurations in some tasks (but not all). A C o s t ( $ ) over-provisionour optimizationunder-provisionsparc_core coyote ariane swerv0100200300 T o t a l R un t i m e ( m i n s . ) Fig. 6: Cost savings from running our muli-choice knapsackoptimization algorithm. Over-provisioning runs all stages on8 vCPUs. Under-provisioning runs all stages on 1 vCPUs.very tight time constraint cannot be met and no solution ispresented. Figure 6 shows the cost savings that we get fromrunning our optimization as compared to over-provisioning(using 8 vCPUs in all jobs) or under-provisioning (using 1vCPU in all jobs) machines. It offers an average of 35.29%cost saving with minimal overhead to the best runtime.V. C
ONCLUSIONS
We present an end-to-end workflow for a cost-efficientdeployment of EDA workloads on the cloud. Our methodsaves 35.29% of the costs while meeting scheduled deadlines.The code is open-source under a permissive license (BSD-3)and is available publicly on GitHub .R EFERENCES[1] V. Kamath, R. Giri, and R. Muralidhar, “Experiences with a private enter-prise cloud: Providing fault tolerance and high availability for interactiveeda applications,” in , 2013, pp. 770–777.[2] N. Sehgal, J. M. Acken, and S. Sohoni, “Is the eda industry ready forcloud computing?”
IETE Technical Review , vol. 33, no. 4, pp. 345–356,2016.[3] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov,M. Shahrad, A. Fuchs, S. Payne, X. Liang et al. , “Openpiton: An opensource manycore research framework,”
ACM SIGPLAN Notices , vol. 51,no. 4, pp. 217–232, 2016.[4] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” arXiv preprint arXiv:1609.02907 , 2016.[5] H. Kellerer, U. Pferschy, and D. Pisinger,
The Multiple-Choice KnapsackProblem . Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 317–347. [Online]. Available: https://doi.org/10.1007/978-3-540-24777-7 11[6] K. Dudzi´nski and S. Walukiewicz, “Exact methods for the knapsack prob-lem and its generalizations,”
European Journal of Operational Research ,vol. 28, no. 1, pp. 3–21, 1987.[7] L. Amar´u, P.-E. Gaillardon, and G. De Micheli, “The epfl combinationalbenchmark suite,” in
Proceedings of the 24th International Workshop onLogic & Synthesis (IWLS) , no. CONF, 2015.[8] “Opencores.” [Online]. Available: https://opencores.org/1