Predicting Memory Compiler Performance Outputs using Feed-Forward Neural Networks
PPredicting Memory Compiler
Performance Outputs usingFeed-Forward Neural Networks
Felix Last , Max Haeberlein , and Ulf Schlichtmann Department of Electrical and Computer Engineering, Technical University of Munich Intel Deutschland GmbH, Neubiberg, Germany
Typical semiconductor chips include thousands of mostly small mem-ories. As memories contribute an estimated 25% to 40% to the overallpower, performance, and area (PPA) of a product, memories must be de-signed carefully to meet the system’s requirements. Memory arrays arehighly uniform and can be described by approximately 10 parameters de-pending mostly on the complexity of the periphery. Thus, to improvePPA utilization, memories are typically generated by memory compilers.A key task in the design flow of a chip is to find optimal memory compilerparametrizations which on the one hand fulfill system requirements whileon the other hand optimize PPA. Although most compiler vendors alsoprovide optimizers for this task, these are often slow or inaccurate. Toenable efficient optimization in spite of long compiler run times, we pro-pose training fully connected feed-forward neural networks to predict PPAoutputs given a memory compiler parametrization. Using an exhaustivesearch-based optimizer framework which obtains neural network predic-tions, PPA-optimal parametrizations are found within seconds after chipdesigners have specified their requirements. Average model prediction er-rors of less than 3%, a decision reliability of over 99% and productive usageof the optimizer for successful, large volume chip design projects illustratethe effectiveness of the approach. 1 a r X i v : . [ c s . OH ] M a r Introduction
For the past five decades, the electronic chip industry persistently fulfilled on the ex-ponential frequency and power improvements initially forecasted by Gordon Moore inthe 1960s. However, maintaining steady improvements at such ever-increasing pacebecomes more and more challenging on today’s advanced technology nodes, where newphysical phenomena and limitations in lithography resolution prohibit mere continu-ation of the current practice. Instead, power, performance, and area (PPA) trendcontinuation at aggressively scaled sub-20-nm nodes requires substantial innovation ofelectronics design automation (EDA) tools.Owing to their sheer count on today’s chips, memories such as static random ac-cess memories (SRAMs), read-only memorys (ROMs), and register files (RFs) havesignificant impact on the PPA on modern integrated circuits (ICs). Because differentrequirements apply to each memory, such as bit size and the number of I/O ports,there is no single blanket solution for optimal memory design. Instead, each memoryinstance must be fine-tuned separately in order to achieve overall product PPA targets.Aggravatingly, focus often shifts during the design process, with initial priority givento minimal area until power concerns begin to dominate later design cycles. Conse-quently, techniques are needed to select optimal memory designs for given requirementsto serve the short turnaround times and flexibility requirements of an increasingly dy-namic design environment.Memory compilers are the most common choice for generating memories in modernIC design flows (Guthaus et al., 2016). These tools, which are typically providedby the manufacturing foundry or external vendors, provide parametrizable librariesof memories which are verified and characterized for the respective technology node.Given parameters such as the number of words, the word width, as well as a multitudeof architectural parameters, a memory compiler returns the netlists, RTL codes, andother artifacts required both for the IC design flow and eventually for manufacturing.Moreover, the compiler produces accurate PPA estimates for the memories. However,with a growing number of architectural parameters which target the improvement ofvarious PPA aspects, complexity of memory compilers also increases significantly.The vast number of options chip designers have to configure makes it difficult todetermine those settings most suitable for the overall product’s PPA targets. This isespecially true since most of the compiler input parameters are architectural parame-ters, for which values can not be derived directly from system requirements, yet havesignificant effects on PPA.Interactions with system parameters make the task of selecting architectural param-eters significantly more complex. To name one example, a large macro may have tobe split into multiple banks along the wordline in order to achieve frequency require-ments, whereas unnecessary banking beyond matching frequency should be avoidedto keep area minimal. Therefore, the number of banks - an architectural parameter- must be carefully tuned for each individual memory in accordance with its size andtarget frequency. 2ompiler run times of more than 30 minutes prohibit manual or automated trial-and-error search of optimal compiler input parameters. Moreover, for a given set ofsystem parameters, multiple compilers may be available on a single technology node,further and significantly expanding the search space. Lastly, during architecture ex-ploration phase, where different technology nodes are compared, achievable PPA mustbe estimated for all applicable compilers even across vendors and process technologies.In practice today, no standard solution to optimizing memory compiler input pa-rameters has been established. Instead, parameter selection is often done by expertswith significant experience of working with the respective compilers. However, expertknowledge is challenged by rapid compiler update cycles and a wide range of availabletechnology nodes, limiting the applicability of previously learnt heuristics.Chip designers and compiler experts also frequently rely on optimization toolsshipped by compiler vendors. Aside from the clear limitation of these tools’ appli-cability to the respective technology node, which obstructs comparison across nodesand vendors, their accuracy and run times are often not satisfactory, a result of over-simplified models or dependence on exhaustive compiler executions. IC design teamsare further forced to rely on the vendor to provide an optimizer and to keep it up-to-date, while they are granted only limited insight into the reliability of its results.
While, to the best of our knowledge, no published literature exists on finding memorycompiler input parameter which optimize PPA, several optimization problems in theEDA space exhibit similar characteristics. The generalized goal of such optimizationproblems is to find optimal parameters of electrical components with regard to sometarget dimension such as PPA or yield. The limiting factor of such optimization isusually a costly evaluation of solutions, which in our case involves memory compilerexecution and otherwise mostly requires expensive simulations.Wang et al. (2018) distinguish three approaches for yield optimization: Monte Carlo-based, corner-based and model-based optimization. As Monte Carlo-based techniquesrequire an extensive amount of solution evaluations, most published work focusseson extensions which reduce the total number of required evaluations or simulations.Corner-based approaches, on the other hand, focus on finding parameters which opti-mize worst-case performance, which requires significantly fewer evaluations, but maylead to over-design and consequently unused potential with regard to the optimizationtarget. Lastly, model-based approaches substitute the simulator (or, for our task, thememory compiler) during optimization and enable cheap evaluation of solutions.Yao et al. (2015) maximize SRAM yield using a genetic algorithm which evaluatessolutions using either a circuit simulation or a surrogate model trained on-the-fly,depending on an importance heuristic computed for each solution. Similarly, Wanget al. (2018) also combine a model-based and a Monte Carlo-based approach for SRAMyield optimization, exploiting Bayesian uncertainty estimates to simulate only thosesolutions where both uncertainty and estimated yield are high, while using an incre-mentally fitted model for other solutions.Memory compiler vendors often provide optimizers to find optimal architectural3arameters for given system parameters. A major shortcoming of these solutions islack of transparency as neither accuracy nor methodology are published or sharedwith customers. The issue is magnified by bad accuracy owing to oversimplificationsin model-based approaches such as linear models. Other solutions typically rely onlinear interpolation of the PPA of similar parametrizations in a large database whichis expensive to construct and maintain. Such methods are more closely related toMonte Carlo-based methods. Finally, the use of vendor optimization tools requiressystem architects who wish to compare technology nodes to switch between severaluser interfaces which may provide PPA outputs in different units or display differentdimensions altogether.Our proposed approach falls in the model-based category, which provides the ad-vantage of fast solution evaluation without expensive simulations or compiler runs.Model-based approaches therefore support many or even exhaustive solution evalua-tions during an optimization run. However, according to Wang et al. (2018), model-based approaches involve a trade-off in terms of accuracy while requiring a large num-ber of a-priori simulations to generate training data for model fitting. We argue thatthe latter drawback is acceptable when trained models are used not for a single opti-mization task, but for multiple designs and in multiple stages of the design process.Moreover, sufficient accuracy can be guaranteed by thorough and unbiased model eval-uation. Lastly, as explained in detail in Section 3.3, model accuracy does not needto be perfect as long as solutions selected based on model estimates are optimal. Asingle, final simulation (or compiler run) of the selected solution can ensure a reliableestimate of the target dimension (e.g. PPA or yield).
In order to tackle the challenge of finding memory compiler input parameters whichoptimize PPA for a given memory, we propose fitting behavioral models of memorycompilers. Realized as feed-forward neural networks, such models are capable of ac-curately predicting PPA of a given compiler parametrization in extremely short time,thereby enabling efficient search of compiler options for given design requirements.This approach not only automates the memory selection and parametrization process,significantly reducing its complexity, but also improves PPA results by enabling a muchbroader set of possible parametrizations to be evaluated. Applicable independently ofvendor, the approach further facilitates rapid PPA estimation and comparison of com-piler vendors and technology nodes.
This work is structured as follows: In Section 2 we detail the proposed solution. Herein,the optimizer framework is discussed separately from the behavioral models, whichare described in terms of data structure, neural network architecture, and trainingprocedure. We proceed to evaluate our approach in Section 3, dividing the analysisinto model evaluation on the one hand and resulting decision quality assessment on4he other. Finally, we summarize our findings and provide an outlook on future workin Section 4.
To the challenging task of finding optimal memory compiler input parameters for givensystem parameters, we propose a two-part solution: an optimizer framework whichsearches the parameter space, and neural networks trained as behavioral models ofthe memory compilers, which predict the PPA of potential solutions. The overallarchitecture is presented schematically in Figure 1.Figure 1: Schematic representation of the proposed solution architecture. Given sys-tem parameters, the optimizer generates all possible compiler parameter combinations.The behavioral models are used to predict the PPA of each solution, before the op-timizer removes solutions with insufficient frequency and ranks remaining solutionsaccording to various PPA criteria.
To better understand the role of the behavioral models, we firstly introduce the memoryoptimizer, the interface through which chip designers access the behavioral models’predictions.The optimizer aims to find a set of compiler input parameters which optimize PPA.An optimization run is characterized primarily by the set of fixed and free compilerinput parameters. Fixed parameters are predetermined by the chip designer in accor-dance with system parameters, the latter of which can be viewed as external require-ments to the memory. In contrast, free parameters are subject to optimization andconsist of mostly of architectural parameters, which control the internal architectureof the memory and significantly affect its PPA.The fixed parameters port configuration, word width, and word depth must alwaysbe specified. The port configuration, which determines the number of read and writeports of a memory, qualifies the set of compilers available for optimization. Similarly,word depth and word width may limit the set of available compilers because compilers5ave a fixed size range. Fixed parameters usually comprise other predetermined sys-tem parameters. For example, a memory with a separate voltage for the periphery isonly viable when the IC can provide separate power supplies, therefore chip designerstypically fix the compiler’s “dual rail” parameter in advance. However, system param-eters may also remain undetermined, for example when comparing technology nodesduring architecture exploration phase. When left undetermined, system parametersare added to the set of free parameters.Architectural parameters, on the other hand, always remain free and subject tooptimization. Each free parameter can assume one of multiple discrete values. Theset of possible values is determined by the choice of compiler, as well as the specifiedword depth and word width. For example, a compiler may allow only up to two banksfor a memory with small word depth, while requiring at least four banks for deepermemories.The set of all possible parametrizations of a compiler is approximately in the orderof magnitude of 10 . Clearly, it is not viable to execute memory compilers prophylac-tically to determine the PPA of the full compiler parametrization space. The searchspace of a single optimization run, on the other hand, consists of all legal compilerparametrizations or solutions which match the set of fixed parameters, most impor-tantly word depth and word width. The fixed parameters therefore constrain thesearch space to a hyperplane of the total parametrization space.Another perspective is that the number and range of free parameters define the num-ber of possible solutions. Each free parameter adds a dimension to the combinatorialsearch space of the respective compiler. For example, the search space of a compilerwith four free parameters where each takes on one of three possible values consistsof 3 = 81 solutions. Given four available compilers, the total number of solutions isthen 4 ×
81 = 324. More free parameters and more available combinations can easilyincrease this number to multiple thousands.Even for a single memory optimization task with fixed word depth and word width,long compiler run times of at least 30 minutes make exhaustive execution infeasible.Assuming a compiler run time of 30 minutes and parallel execution of 20 compilerprocesses, the time to evaluate n possible solutions is given by n hours. For the aboveexample of 324 possible parametrizations, an exhaustive search directly using memorycompilers would take over 8 hours while the evaluation of 1000 solutions would take25 hours. Even in the unrealistic case of n parallel compiler executions, exhaustiveevaluation time is lower bounded by the compiler run time. To make matters worse,these numbers concern the optimization of only a single memory, one of possiblythousands of memories in an IC.Numerical optimization approaches, both statistical and deterministic, are based onfunction evaluation in a loop. However, any such method’s convergence time is lowerbounded by the function evaluation time. Besides designing optimization methods thatminimize the required number of function evaluations, methods have been proposedto replace the direct, usually extremely expensive function evaluation by models. Byusing neural networks (or another behavioral model with low inference time) to predictthe PPA outputs of memory compilers, the evaluation of a possible solution becomessignificantly faster and computationally cheaper. In fact, neural network inference is6o fast that, in practice, exhaustive evaluation of the search space can be done in lessthan 5 seconds for most cases as more than 150 solutions can be evaluated per second.For a better understanding of the optimization context, it is important to considerthat compiler PPA outputs are the result of characterization of the memory arrayfor different process variations (e.g. typical, fast, slow, referred to as process corner)and under different operating conditions (voltage and temperature) Weste and Harris(2015). A combination of process corner and environmental conditions is referredto as design corner. As such, five of six PPA dimensions, precisely dynamic readpower, dynamic write power, leakage, access time, and cycle time, but not area, arerepresented not as a single value per memory, but as a set of design corner-specificPPA variables.The optimizer results display as pairs of compiler parametrizations and respectivePPA outputs as predicted by the behavioral model. These pairs are organized infour separate lists, each ranked according to a different PPA dimension. The rankingcriteria of the lists are dynamic power, leakage, area, and a weighted sum of theformer. All evaluated pairs appear in all four lists, albeit in potentially differentranking positions. The frequency dimension is not used for result ranking; instead,it is used as a threshold criterion to remove solutions which do not achieve productfrequency requirements. A frequency exceeding requirements does not impact rankingof results in any way. As all PPA dimensions except area consist of multiple PPAvariables (one for each design corner), a single PPA variable must be determined perPPA dimension which is used for ranking (or, in the case of cycle time, filtering) ofresults. The chip designer must therefore select one design corner per PPA dimensionto determine the relevant PPA variable.In addition to the three rankings based directly on PPA dimensions, the weightedsum ranking linearly combines dynamic power, leakage and area dimensions of eachresult according to weighting factors defined on IC product level. The specific PPAvariables used for this computation depend on the design corners chosen by the chipdesigner. Because the variables combined in this weighted ranking are on differentscales and display different variances, a linear combination in original scale is notpractical. For example, an area in the magnitude of thousands of square micrometerswould otherwise have a much larger effect on the ranking than dynamic power, whichvaries on a scale which is around three orders of magnitude smaller. Therefore, beforethe weighted ranking can be computed, each variable must be brought to a common,comparable scale by means of standardization, i.e., subtracting the mean and dividingby standard deviation. Mean and variance are estimated based on known PPA dataas part of data preparation, which is discussed in Section 2.2. As discussed in the previous section, finding optimal memory compiler input parame-ters in an efficient manner requires a model with low inference time to replace expensivecompiler evaluations. Such a model accepts the same input parameters as the com-piler it is trained to substitute during optimization. To refer to the set of model inputparameters in contrast to compiler input parameters, we use the customary term “ex-7lanatory variables”. From the set of explanatory variables, a behavioral model inferscompiler PPA outputs, so-called target variables.Note that although memory compilers produce many essential artifacts such asnetlists and RTL codes, only the task of predicting PPA outputs is to be adoptedby the behavioral model. Once chip designers have selected a compiler parametriza-tion from optimizer results, they execute the actual memory compiler. Subsequently,they verify that the behavioral model’s PPA predictions are in line with compiler PPAoutputs and the remaining compiler artifacts are used for the product design flow.A single behavioral model with sufficient capacity could in principle be fitted forall compilers together. However, routinely updated compiler versions as well as anexpanding range of supported technology nodes would require frequent training ofsuch a central model, a process which is computationally expensive. Therefore, wemaintain a separate model for each compiler, creating what is commonly referred toas a “model zoo” Erickson et al. (2017). Models are added whenever new compilers orupdated versions are released so that the model zoo contains one model per compilerand per compiler version. Because models are frozen after training, this approach alsoensures that predictions for a given compiler version are consistent over time.The behavioral models discussed here are trained through a supervised learningapproach Friedman et al. (2001). In supervised learning, a model is fitted based ontraining data where each observation has known values for both explanatory variables( x ) and target variables ( y ). Training data is obtained by executing the respectivememory compiler for a given compiler parametrization. The memory compiler’s PPAoutputs are referred to as ground truth data or y . In contrast, target variables inferredby the behavioral model are called predictions, or ˆ y . Because the target variables arereal-valued rather than discrete, the problem can be characterized as a regression taskFriedman et al. (2001).To collect ground truth data from the memory compiler, samples must first beobtained from the compiler parameter space. Exhaustive sampling of this vast combi-natorial space is unfeasible, mainly owing to the compiler input parameters word widthand word depth, which can each take on hundreds of possible values. To obtain theset of explanatory data (i.e. concrete values for explanatory variables), we randomlyselect 500 combinations of architectural and system parameters by sequentially fixingthe value of each input parameter with uniform random probability among the validchoices.Once compiler results are available, the respective behavioral model is trained andevaluated. Based on the model’s prediction error on test set data, which is not usedfor training, we assess whether more training data is required. If that is the case,generation of another parametrization batch of 500 observations is triggered. Priorto sending the parametrization batch to the compiler, we remove those memorieswhich are within size ranges with sufficiently low prediction errors based on separateprediction error analysis for different memory sizes (see Section 3.2). This resultsin decreasing parametrization batch sizes towards the end of the iterative trainingprocess, consequently speeding up the data generation process.Because the amount of training data generated is driven by the quality of the model’spredictions, the size of dataset differs between behavioral models. In our current8able 1: Input parameters of a prototypical memory compiler Parameter
Bank 4 [1 , , { low < standard < high } Redundancy 3 { None < Row < Row + IO } Word depth 223 [32 , , Parameter Range (approx.) Unit
Area (0 , ] µm Access time (0 , ns Cycle time (0 , ns Dynamic power (read) (0 , µAMHz Dynamic power (write) (0 , µAMHz Leakage (0 , ] µA model zoo for 25 compilers across 3 technology nodes, the median total number ofobservations is 2,500, with some models needing as little as 500 observations and othersrequiring up to 6,000 observations to reach satisfactory prediction errors. For trainingdata generation, the memory compilers are executed in parallel threads. Differentcompilers have different run times and parallelization resources; typically, 20 compilersare run in parallel for 30 to 60 minutes per parametrization. The generation of asingle parametrization batch of 500 observations therefore takes approximately 12 to24 hours.The nature of the collected data is summarized in Table 1 and Table 2 by exampleof a prototypical compiler. For all observed compilers, the explanatory variables (orcompiler inputs) are discrete, ordinal values. With the exceptions of word depth andword width, these two to ten variables can take on between two and four possiblevalues.On the other hand, PPA values are real-valued. We observe up to 20 different designcorners. In total, the number of PPA variables is given by the formula c × c is the number of design corners, 5 is the number of variables which are measuredper design corner (read power, write power, leakage, cycle time, and access time), and1 represents the variable “area”, which is measured independently of design corners.For the average case of 15 design corners, the formula indicates 76 target variables.For this work, we model compiler PPA outputs by means of fully connected feed-forward neural networks. Although the use of other regression models is conceivable,we propose feed-forward neural networks for several reasons. Firstly, their flexiblestructure allows their application to arbitrarily complex problems. Secondly, theirability to capture highly non-linear relationships is well-suited for modelling PPA from9ompiler inputs. Lastly, neural networks allow analytical computation of gradients ofthe target variables with respect to explanatory variables, a property we deem highlydesirable for future work on yet more challenging optimization tasks (see Section 4.2).We proceed to discuss data preparation, which is a significant factor contributingto the successful application of any machine learning technique (Kotsiantis et al.,2006). As input to neural networks non-numerical input variables must be encoded.All explanatory variables are ordinal, and as such can be encoded as integer values.We further add the hand-crafted explanatory variable “size” to the set of explanatoryvariables, which is the product of word width and word depth following the assumptionthat it is a better predictor of the target dimensions area, leakage, and dynamic powerthan either of word depth or word width alone. Target variables, on the other hand, aretransformed by applying the square root. This transformation aims to give more weightto observations with smaller targets, where deviations of the same absolute magnitudehave a relatively larger effect. Lastly, to align variances, explanatory variables as well astarget variables are standardized to a mean of zero and a standard deviation of one (alsoreferred to as z-score normalization). Subsequently, min-max normalization is appliedto ensure that explanatory and target variables of all observations are within a range of[ − ,
1] (see also (Kotsiantis et al., 2006)). The mean, standard deviation and extremaused for standardization and normalization are estimated based on the training setprior to training. Before inference, those same scaling factors are used for rescalingexplanatory variables to the transformed scale and for transforming predictions backto the original scale thereafter.Neural network-based models are highly sensitive to the selection of a suitable ar-chitecture and training procedure (Domhan et al., 2015). The reader is referred toSection 3.2 for experimental justification of the choices laid out in the following. Onthe network’s input layer, the number of units is equal to the number of compiler inputparameters of the respective memory compiler. On the ouput layer, one unit is presentfor each target variable. Between input and output layers, there are two hidden lay-ers. The number of units on each hidden layer is set to be equal to eight times thenumber of input units, which allows the network’s capacity to scale with the problemcomplexity as implied by the number of compiler input parameters. Input and hiddenlayers are activated using the sigmoid function, whereas no activation function is usedfor the output layer.For training, the dataset is split randomly into approximately two thirds for train-ing and one third hold-out data. The hold-out set is again split into two thirds forevaluation (i.e. the test set) and one third for validation. The validation set is usedto determine when training should be stopped, a technique called “early stopping”(Caruana et al., 2001). Early stopping is applied by checking validation set perfor-mance every 200 epochs, where one epoch encompasses one full pass over the trainingset during which network weights and biases are updated. If ten sequential checks findno reduction of prediction error (within a tolerance of 1 × − ), training is stopped.Updates of network weights and biases are determined using the Adam optimizer(Kingma and Ba, 2014) with a learning rate of 1 × − and the mean absolute er-ror is used as the loss function. The update steps are computed based on so-calledmini-batches (Kingma and Ba, 2014) of 100 observations, which are sampled from the10raining set with uniform random probability and without replacement. Note thatthese mini-batches are completely unrelated to the parametrization batches used fortraining data generation; instead, mini-batch is a standard term referring to sets ofsamples from the available training data used to compute a stochastic gradient duringneural network training. To qualify a model for real-world predictions of unseen data, its reliability must beassessed in the context of its application. This section discusses the acceptance criteriadefined for behavioral models and the memory optimizer. We further verify our ap-proach and model architecture by comparing prediction errors across different neuralnetwork architectures as well as other state-of-the-art machine learning algorithms.
A model’s quality can unbiasedly be estimated by evaluating its prediction error forunseen data (held out during training). Section 3.2 discusses the appropriate errormetrics and presents exemplary results of model performance for one of the modelstrained for this study. The behavioral model chosen for the analyses presented in3.2 is representative of our model zoo in terms of number of explanatory and targetvariables; with 4500 observations, the size of dataset is above average.Although the prediction error is an important aspect, it fails to capture the impact onactual decision-making based on the memory optimizer. Because the memory planningworkflow includes a final collection of accurate PPA outputs from the memory compiler,inaccurate predictions are tolerable as long as they lead to the selection of the correctmemory parametrization. Section 3.3 therefore discusses and estimates the reliabilitythe memory optimizer in the context of decision making.
To summarize the quality of a model’s predictions, we compare predictions for obser-vations from the hold-out set to the respective ground truth values, i.e. compiler PPAoutput. However, directly using the absolute prediction error has two disadvantages:On the one hand, interpretation is difficult because the error magnitude depends onthe scale of the target variable. Comparing the quality of predictions between dimen-sions of different units, for example, is not possible. On the other hand, comparingthe absolute prediction error between two observations may not be adequate. Thisis illustrated by two observations for which the absolute prediction error is the same,whereas the ground truth is different; the impact of a large prediction error (i.e. thenominator) is more significant when the true target (i.e. the denominator) is smaller.We therefore use a relative prediction error to scale the absolute difference by the mag-nitude of the true target value. A naive approach to this is the absolute percentageerror (APE), which is computed as shown in Equation 1.11PE = ˆ y − yy ×
100 (1)While APE is popular because of its numerous benefits, such as scale-independence,ease of interpretation, and scaling with the true target value, past work has alsouncovered severe issues with the metric, one of which is the systematic bias towardsmodels which under-predict (Tofallis, 2015). To avoid these drawbacks, Morley et al.(2018) introduced the symmetric signed percentage bias based on the log accuracyratio (see also Tofallis (2015)). This metric does not favor over- or under-predictions,yet it is easily interpretable as a percentage. As we are interested in the magnitudeof errors rather than the direction, we remove the sign by taking the absolute value.Henceforth we use the unsigned symmetric percentage bias as shown in Equation 2,referring to it as relative error or percentage error.Symmetric Percentage Bias = (cid:18) exp (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) ˆ yy (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:19) − (cid:19) ×
100 (2)Because the relative error is lower bounded by zero, but unbounded upwards, itsdistribution is typically skewed, exhibiting a tail on the right hand side. This featuremakes the arithmetic mean, which is heavily influenced by outliers, inappropriate asa measure of average. To summarize the prediction error across observations, wetherefore use the median to obtain a more representative measure of average. Notethat usage of the arithmetic mean to average multiple medians is robust to the skewand more expressive, for example when aggregating across target variables.To understand the reliability of each dimension’s predictions, we break down theprediction error by target variable. As explained in Section 2.1, some PPA dimensions(leakage, access time, cycle time, and dynamic read and write power) are representedby many PPA variables for different design corners. We aggregate the model’s targetvariables into PPA dimensions in order to obtain a less cluttered visualization. Incontrast to the aggregation across observations, the arithmetic mean is used to averageacross variables. Figure 2 shows the result of this analysis for a single behavioral model.The same analysis is performed across all behavioral models, the result of which isshown in Figure 3. In order to give each model equal influence on the displayedsummary statistics, we randomly sample the same number of observations from eachmodel’s test set before computing and aggregating all errors.The figure shows that median errors of at most 3% are achieved across variables.Best case predictions for each variable exhibit close to no deviation from ground truth.The 25%-quantile is consistently between 0% and 2% error, indicating a very lowprediction error for a quarter of observations in the test set. The 75%-quantile isbelow 3% for 3 out of 6 PPA dimensions, while it is just above 5% for area andleakage, for which prediction error is highest.We further analyze the relative prediction error in relation to memory size in bits.This is done to focus training data generation on size ranges with insufficient predictionquality as discussed in Section 2.2. For the analysis, observations from the test set arefirst grouped into 10 bins according to their size in bits. The average per bin is then12 r e a l e a k a g e t i m e ( c y c l e ) p o w e r w r i t e p o w e r r e a d t i m e ( a cc e ss ) P r e d i c t i o n e rr o r i n % Figure 2: Relative error on the test setof one representative model, grouped byPPA dimension. Box edges represent the25%- and 75%-quantiles, the distance be-tween them is the interquartile range.Lines in the center of each box representthe median. Whiskers on each side ofa box extend to include all observationswhich are within a distance of 1.5 timesthe interquartile range from the 25%- and75%-quantiles respectively. Outliers out-side of this range are not shown. a r e a l e a k a g e p o w e r w r i t e t i m e ( c y c l e ) p o w e r r e a d t i m e ( a cc e ss ) P r e d i c t i o n e rr o r i n % Figure 3: Relative test set error acrossall 25 models grouped by PPA dimen-sion, with equal weight attributed to eachmodel.computed by calculating the median across observations, before aggregating acrossvariables using the arithmetic mean. Figure 4 reveals that the expected predictionerror is distributed fairly evenly across bins, asserting that prediction quality is stablefor arbitrarily sized instances, with average prediction errors consistently below 2.5%.As the shade of each bar indicates the number of observations in the respective bins,it is visible that most observations are small memories. When the number of bit cellsis small, the impact of the memory periphery on a memory’s PPA is relatively larger.As most compiler input parameters affect the periphery rather than the bit cell array,it is more difficult to accurately predict the PPA of small memories. To achieve acomparable prediction error, more training data is thus needed for small word widthsand word depths.To determine an optimal architecture for the neural network, we perform a gridsearch across neural network architecture options. We test different values for thefollowing options: the number of hidden layers, the hidden layer unit multiplier, theoutput layer activation function and the hidden layer activation function. For the13
10 20 30 40 50 60 70 80 90
Size bin lower bound (in % of max. size)0.00.51.01.52.0 P r e d i c t i o n e rr o r i n % Overall mean median 2004006008001000 o b s e r v a t i o n s Figure 4: Average relative error evaluated on the test set averaged for different memorysize ranges. Bars represent size ranges, and their shade indicates the number of testset observations available in that size range.number of hidden layers, we test the values { , , , , } . The hidden unit multiplier,which determines the number of units on each hidden layer by means of multiplicationwith the input layer dimension, the values { , , , , , } are considered. We skipextremely large network architectures where the number of hidden layers is eight andthe hidden layer unit multiplier exceeds six. For hidden layers, the activation functionssigmoid, tanh and rectified linear (relu) are considered. Regarding the output layer, noactivation and relu are tested. For the latter, the output range is corrected to be lowerbounded by -1 instead of 0 so that the same data scaling factors, which are based onvalues in [ − , × − , the mini-batch size to 100, disabling dropout andapplying early stopping as described in Section 2.2. In total, we evaluate 180 differentneural network architectures.Figure 5 shows one plot for each hidden layer activation function. Within eachplot, there is one square for each evaluated architecture, labeled with the relativeprediction error. This prediction error is computed by taking the median relativeerror per variable, before using the arithmetic mean to aggregate across variables.Additionally, this metric is averaged across the folds of three-fold cross-validationusing the arithmetic mean. In three-fold cross-validation, the dataset is split threetimes, each time using a different portion of data for training and the remainder fortesting. The color scale of the charts ends at a prediction error of 10% in order tofacilitate discernibility among lower values. Results obtained using relu as the outputactivation are omitted because they consistently exhibit prediction errors of more than30%, with a single exception (2.9% prediction error at eight hidden layers and a hidden14nit multiplier of one). All plots shown therefore use no activation function for theoutput layer. h i dd e n l a y e r un i t m u l t i p li e r hidden activation: relu hidden activation: sigmoid hidden activation: tanh Figure 5: Average, cross-validated relative prediction errors for different neural net-work architectures. Architectures evaluated in each plot share a hidden layer activationfunction, as labeled above the respective chart. Squares within each plot correspond toa specific number of hidden layers (x-axis) and a hidden layer unit multiplier (y-axis),which determines the number of units on each hidden layer through multiplicationwith the number of input units.The most obvious trend visible in Figure 5 is that the sigmoid activation of hiddenlayers tends to yield the best results. The tanh activation achieves appreciable re-sults, which, however, cannot be sustained for larger network architectures. The reluactivation for hidden layers, on the other hand, appears inadequate, with predictionerrors consistently above 5%. The lowest prediction error found overall is 2%, whichis achieved with a network architecture of four hidden layers, sigmoid activation forhidden layers, and four times as many hidden units as input units. Similar scoresare obtained for many other configurations which use sigmoidal hidden layer activa-tions, especially when at least two hidden layers are present and the hidden layer unitmultiplier is at least two. The trend appears to decline when both of these neuralnetwork architecture parameters approach the upper end of the tested range. Basedon preliminary results of this analysis, a neural network architecture with two hiddenlayers and a hidden layer unit multiplier of eight (with an expected prediction error of2.3%) has been used for other analyses in this work.Aiming to understand which correlations the behavioral model has learnt from thedata, we analyze the derivative of the neural network with respect to the explanatoryvariables. A positive derivative indicates a positive correlation between explanatoryvariable and target variable, and vice versa. The derivatives are computed and aver-aged across the combined training and test set using the arithmetic mean, which is alsoused to aggregate target variables to PPA dimensions. For each explanatory variable,the gradient values are normalized into a range of [ − ,
1] in order to emphasize whichPPA dimensions are most impacted by each explanatory variable.The results of this analysis, visualized in Figure 6, show the predicted impact ofcompiler input parameters on PPA. For example, the model seems to have properly15 a n k b i s t m u x i n h i p c e n t e r d e c o d e c o l u m n m u x h a s p o w e r g a t i n g i s d u a l r a il p e r i p h e r y v t r e a d a ss i s t r e p a i r s i z e w o r d d e p t h w o r d w i d t h w r i t e a ss i s t w r i t e i n d i v i d u a l b i t compiler parametersarealeakagepower readpower writetime (access)time (cycle) PP A Figure 6: Average gradient of neural network outputs (PPA variables) with respectto compiler input parameters. The values are normalized by input variable, so thateach column shows which PPA variables are impacted most by the respective compileroption.learnt that as the periphery voltage threshold increases, leakage decreases, but cycletime increases. A larger number of banks, on the other hand, leads to a larger areawhile decreasing access time. Another relationship learnt by the model is unsurprising:there is a positive correlation between size in bits and area.We further compare our approach of fitting compiler data using feed-forward neuralnetworks to state-of-the-art regression techniques from statistics and machine learning.Specifically, we compare least squares linear regression, gradient boosting (Friedman,2001), AdaBoost (Freund and Schapire, 1997), random forest regression (Breiman,2001), and polynomial regression (linear regression with polynomially transformedexplanatory variables). We evaluate the prediction error of different configurations ofthese models using cross-validation. Prior to applying cross-validation, we extract aportion of the data as a validation set used for early stopping neural network training.For the neural network-based model, we use the same neural network architecture asfor other analyses is this work.The results are presented in Figure 7, where linear regression as well as AdaBoostwith 500 and 1000 estimators are excluded for clarity as their errors exceed 6%. Thefigure clearly shows that our proposed regression using neural networks is the bestapproach among the compared techniques. Seven out of twelve approaches (includingthose three not shown in the chart) are outperformed by more than one percent, whilecloser followers like gradient boosting with 500 estimators and 3rd degree polynomialregression are outperformed by a small, yet significant margin. Statistical significancewas determined using a two-sided, paired Student’s t-test Student (1908) to test the16 e u r a l N e t w o r k G r a d i e n t B oo s t i n g P o l y n o m i a l R e g r e ss i o n G r a d i e n t B oo s t i n g G r a d i e n t B oo s t i n g L A D G r a d i e n t B oo s t i n g R a n d o m F o r e s t P o l y n o m i a l R e g r e ss i o n R a n d o m F o r e s t P r e d i c t i o n e rr o r i n % ** * ** ** **** *** *** *** Figure 7: Comparison of relative prediction errors between various regression tech-niques and the proposed feed-forward neural networks. Each bar represents the meanacross three cross-validated train-test cycles. Whiskers indicate the standard devia-tion. All methods are regression models, not the sometimes better known classificationtechniques with the same name.null hypothesis that the true mean error of each method is the same as the true meanerror of the proposed method. The null hypothesis is rejected for every evaluatedmethod, where asterisks above each bar represent the significance level. The numberbehind a method’s name indicates the number of estimators (regression trees) in case ofensemble methods, or the polynomial degree in case of polynomial regression. Gradientboosting regression was performed using least squares loss or least absolute deviation;use of the latter is indicated by “LAD” in the figure. For unspecified parameters, thedefault choices as provided by the scikit-learn Pedregosa et al. (2011) framework atversion 0.20.3 are used.The low inference time of neural networks is among the key arguments of theirapplication to the memory selection problem. Table 3 shows the inference time of aprototypical model from our model zoo, measured for different numbers of samplespredicted. This is the pure prediction time, excluding the overhead of several set-up tasks, such as loading the model from disk into memory and preprocessing thedata. Timing analysis was performed on a machine with a single virtual CPU and2 Gigabytes of memory. Inference was repeated 1,000 times, retaining the minimumrun time in order to approximate the lower bound of required computing time. Asthe timing results demonstrate, inference takes consistently less than a second for up17o 10,000 observations. Moreover, even inference of 100,000 observations does nottake significantly longer than a second. The scaling factor shows how much longerthe inference is for each sample size compared to the time taken to predict a singleobservation. This illustrates that predicting large numbers of samples at once is muchmore efficient than individual prediction, a property which enables the efficient searchspace evaluation upon which the optimization framework relies.Training, which is performed on the same hardware as inference, approximatelytakes between 5 and 30 minutes. However, training time depends not only on thenumber of observations, explanatory variables, and target variables in the dataset, butalso on the number of epochs needed until early stopping determines convergence. Forexample, training with a typical dataset of 2,500 observations for 500 epochs takes 40seconds. Training is normally stopped after 5,000 to 20,000 epochs. Trained modelsare stored on disk and use less than 200 Kilobytes, but can easily be compressed toless than half that size.Table 3: Inference times of a prototypical behavioral model . × − . . × − . . × − . . × − . × . × − . × .
11 5 . × Prediction error evaluation alone does not provide a comprehensive view of the qualityof optimizer results. The goal of this analysis is to devise a metric capable of estimatingdecision reliability in the context of the memory optimizer. For this purpose, assumethat a chip designer will select the memory parametrization ranked first according totheir selected PPA criterion (see Section 2.1). Under this assumption, it is intuitivethat the correct decision will be made if the truly best suited memory is ranked first.Assume that the compiler parametrizations ranked first ( x ) and second ( x ) in theoptimizer results are separated by a distance of d = ˆ y − ˆ y in terms of predicted PPA.As illustrated in Figure 8, it follows then that the two instances will be ranked in thewrong order (relative to each other) if the sum of the prediction errors (cid:15) + (cid:15) of theresults exceeds d . In other words, when the over-estimation of the first ranked resultand the under-estimation of the second ranked result exceed the distance between thetwo, a wrong decision will be made. The same applies to all pairs ( x , x i ) in a result’sranking.Estimating decision reliability in the context of the memory optimizer poses somechallenges. Firstly, evaluations can no longer be made on a per-compiler basis, asmultiple memory compilers make up the set of results of a single optimizer run. Con-18igure 8: Illustration of the computational basis for the decision reliability analysis.When the error distributions (cid:15) centered around ranked predictions ˆ y overlap, the re-liability of the ranking and hence of the ranking-based decision are impaired. Thedecision reliability score of a given ranking is computed by averaging the estimatedsize of the wrong decision regions of every pair of results involving the first rankedprediction.sequently, all compiler models which apply to the selected port configuration have tobe jointly evaluated. Secondly, the vast number of possible parametrizations for anygiven optimizer run makes it infeasible to gather exhaustive ground truth PPA outputsfrom the memory compilers. This problem becomes even more apparent when the goalis not to assess an optimization run for a single memory, but a large enough sample ofsuch.To approach a more feasible evaluation method, the expected prediction error - asestimated based on test set data - is devised to approximate each result’s deviationfrom ground truth. We aim to estimate the average integral overlap of errors aroundthe predicted results (cid:15) + x , corresponding to the shaded area of Figure 8. As someerrors may not be distributed normally, we adopt a numerical method of computingthe integral by randomly sampling from the joint error distributions of a result set.This means that we modify the ranked results given by the optimizer by repeatedlysampling from each parametrization’s expected error distribution and adding the sam-pled errors to the prediction. When predictions for all memories have been adjustedby the sampled errors, a new ranking is computed. We then determine if the initiallyfirst ranked instance is still ranked first. After repeating this process 1,000 times, wecompute the proportion of repetitions where the first ranked instance remained thesame. We interpret this proportion as a measure of decision reliability in the givenoptimizer run, where 100% is the best attainable value and 0% is the lower bound.Expected error distribution of each result is estimated based on the prediction errorsof at least 100 similar samples from the test set. Similar samples are chosen basedon proximity in terms of size in bits. When available, memories of the same size areused for error estimation, otherwise we select 50 neighbors with a larger size and 50neighbors with a smaller size. The Shapiro–Wilk test (Shapiro and Wilk, 1965) isconducted to test the null hypothesis of normality of errors. For distributions wherethe null hypothesis is rejected (based on a p-value of 0 . PPA Dimension Mean 95%-Quantile Minimum area 100% 100% 100%leakage 99 .
9% 100% 16 . .
2% 100% 47 . .
4% 100% 41 . In order to assess the real-world benefit of the proposed memory optimization frame-work, we compare an existing memory selection performed by human experts to aselection made based on the optimizer. The existing expert-based selection consistsof 5,623 memories of various bit sizes, target frequencies and other fixed system pa-rameters. We apply the optimizer to minimize area, while for each memory the sametarget frequency which constrained the experts’ selection must be met. For each mem-ory, we select the physical instance ranked first according to area. Subsequently, wecompare both methods in terms of PPA dimensions, based on the same design cornersas the expert-based selection criterion. PPA of each selection method is assessed byusing compiler outputs where available or network predictions otherwise, summingthe individual memories’ PPA to calculate PPA for the whole selection. The differ-ences in PPA are reported as a percentage of the PPA of the expert-based selection.Less than four hours are required to complete the optimization of all memories, whileoptimization by human experts took approximately 10 work weeks.Table 5: PPA difference of optimizer-based vs. expert-based memory selection, relativeto the expert-based selection.
PPA Dimension Difference area − − − In this section, we discuss the results presented in Section 3 and their implications forIC design. We further explore future challenges and research opportunities related tomemory compiler PPA optimization.
The evaluation of the optimizer shown in Section 3.3 illustrates the effectiveness of ourproposed solution at finding the best possible compiler input parameters given designrequirements with a decision reliability of over 99% on average. This achievement isowed to models with high prediction quality as revealed by average prediction errorsbelow 2.5% in light of a complex, high-dimensional relationship between compiler inputparameters and PPA outputs.Meanwhile, the optimization is a fully automatic process with remarkably low runtimes averaging at less than ten seconds. New compilers versions are further sup-ported in a matter of days after their release, with quality assurance provided by theautomation of data generation, model training and model evaluation cycles.As comparison with expert-based memory selection for a real-world chip demon-strates, the proposed solution also attains sizable gains over careful manual optimiza-tion, yielding more than 10% reduction in terms of area, leakage and dynamic power.It is important to note that the memory optimizer is already in full productive use formultiple large volume design projects and not merely a proof of concept. Successfulcompletion of real-world, commercial chip design projects which relied on the mem-ory optimizer further manifests the value add of our approach. Through the use ofthe memory optimizer, the complexity of the design process was reduced, resulting inan estimated 20% of time-savings of the selection and parametrization process for ICproducts.
While the difficult task of optimizing compiler input parameters for a given memoryhas been solved to our satisfaction, many challenges in the space of memory compilerparameter selection remain. One major topic is the optimal tuning of many memories21n an ensemble, which could enable valuable use cases such as system optimization oroptimization of compounds of multiple physical macros. For either case, optimizingthe system of memories as a whole rather than individually enables unique solutionswith PPA trade offs between memories rather than individually balanced instances,leading to improved overall PPA. The ensemble problem is characterized by a com-bination of possibly thousands of memories, resulting in a much larger search space.Even given sub-millisecond model inference times, minimization techniques beyondexhaustive search are required. We envision that neural network properties, such asanalytical gradient computation, can be exploited to efficiently guide the search spaceexploration. Applied to compound memories, such a solution would fill the optimiza-tion gap for memory design edge cases such as extremely high-frequency or large-sizemacros. On the other hand, rapid optima search across all memories of an entire ICwould revolutionize the design flow as a whole, making early product PPA estimationas well as final memory optimization automated, fast and accurate.
References
Breiman, L. (2001). Random forests.
Machine learning , 45(1):5–32.Caruana, R., Lawrence, S., and Giles, C. L. (2001). Overfitting in neural nets: Back-propagation, conjugate gradient, and early stopping. In
Advances in neural infor-mation processing systems , pages 402–408.Domhan, T., Springenberg, J. T., and Hutter, F. (2015). Speeding up automatichyperparameter optimization of deep neural networks by extrapolation of learningcurves. In
Twenty-Fourth International Joint Conference on Artificial Intelligence .Erickson, B. J., Korfiatis, P., Akkus, Z., Kline, T., and Philbrick, K. (2017). Toolkitsand Libraries for Deep Learning.
Journal of Digital Imaging , 30(4):400–405.Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-linelearning and an application to boosting.
Journal of computer and system sciences ,55(1):119–139.Friedman, J., Hastie, T., and Tibshirani, R. (2001).
The elements of statistical learn-ing , volume 1. Springer series in statistics New York.Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of statistics , pages 1189–1232.Guthaus, M. R., Stine, J. E., Ataei, S., Chen, B., Wu, B., and Sarwar, M. (2016).OpenRAM: An Open-source Memory Compiler. In
Proceedings of the 35th Inter-national Conference on Computer-Aided Design , ICCAD ’16, pages 93:1–93:6, NewYork, NY, USA. ACM.Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 . 22otsiantis, S. B., Kanellopoulos, D., and Pintelas, P. E. (2006). Data preprocessingfor supervised leaning.
International Journal of Computer Science , 1(2):111–117.Morley, S. K., Brito, T. V., and Welling, D. T. (2018). Measures of Model PerformanceBased On the Log Accuracy Ratio.
Space Weather , 16(1):69–88.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn:Machine Learning in Python.
Journal of Machine learning research , 12:28252830.Scott, D. W. (1992).
Multivariate density estimation: theory, practice, and visualiza-tion . John Wiley & Sons.Shapiro, S. S. and Wilk, M. B. (1965). An analysis of variance test for normality(complete samples).
Biometrika , 52(3/4):591–611.Student (1908). The probable error of a mean.
Biometrika , pages 1–25.Tofallis, C. (2015). A better measure of relative prediction accuracy for model selectionand model estimation.
Journal of the Operational Research Society , 66(8):1352–1362.Wang, M., Lv, W., Yang, F., Yan, C., Cai, W., Zhou, D., and Zeng, X. (2018). EfficientYield Optimization for Analog and SRAM Circuits via Gaussian Process Regressionand Adaptive Yield Estimation.
IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems , 37(10):1929–1942.Weste, N. H. and Harris, D. (2015).
CMOS VLSI design: a circuits and systemsperspective . Pearson Education India.Yao, J., Ye, Z., and Wang, Y. (2015). An Efficient SRAM Yield Analysis and Opti-mization Method With Adaptive Online Surrogate Modeling.