A Survey of Machine Learning Applied to Computer Architecture Design
11 A Survey of Machine Learning Applied toComputer Architecture Design
Drew D. Penney, and Lizhong Chen ∗ , Senior Member, IEEE
Abstract —Machine learning has enabled significant benefits in diverse fields, but, with a few exceptions, has had limited impact oncomputer architecture. Recent work, however, has explored broader applicability for design, optimization, and simulation. Notably,machine learning based strategies often surpass prior state-of-the-art analytical, heuristic, and human-expert approaches. This paperreviews machine learning applied system-wide to simulation and run-time optimization, and in many individual components, includingmemory systems, branch predictors, networks-on-chip, and GPUs. The paper further analyzes current practice to highlight usefuldesign strategies and identify areas for future work, based on optimized implementation strategies, opportune extensions to existingwork, and ambitious long term possibilities. Taken together, these strategies and techniques present a promising future for increasinglyautomated architectural design. (cid:70)
NTRODUCTION
In the past decade, machine learning (ML) has rapidlybecome a revolutionary factor in many fields, ranging fromcommercial applications, as in self-driving cars, to medicalapplications, improving disease screening and diagnosis. Ineach of these applications, an ML model is trained to makepredictions or decisions without explicit programming bydiscovering embedded patterns or relationships in the data.Notably, ML models can perform well in tasks/applicationswhere relationships are too complex to model using analyt-ical methods. These powerful learning capabilities continueto enable rapid developments in diverse fields. Concur-rently, the exponential growth predicted by Moore’s law hasslowed, putting increasing burden on architects to supplantMoore’s law with architectural advances. These opposingtrends suggest opportunities for a paradigm shift in whichcomputer architecture enables ML and, simultaneously, MLimproves computer architecture, closing a positive-feedbackloop with vast potential for both fields.Traditionally, the relationship between computer archi-tecture and ML has been relatively imbalanced, focusingon architectural optimizations to accelerate ML algorithms.In fact, the recent resurgence in AI research is, at leastpartly, attributed to improved processing capabilities. Theseimprovements are enhanced by hardware optimizationsexploiting available parallelism, data reuse, sparsity, etc. inexisting ML algorithms. In contrast, there has been relativelylimited work applying ML to improve architectural design,with branch prediction being one of a few mainstreamexamples. This nascent work, although limited, presents anauspicious approach for architectural design.This paper presents an overview of ML applied to archi-tectural design and analysis. As illustrated in Figure 1, thisfield has grown significantly in both success and popularity,particularly in the past few years. These works establishthe broad applicability and future potential of ML-enabledarchitectural design; existing ML-based approaches, rangingfrom DVFS with simple classification trees to design spaceexploration via deep reinforcement learning, have alreadysurpassed their respective state-of-the-art human expert andheuristic based designs. ML-based design will likely con- ∗ Corresponding author. Email: [email protected] authors are with Oregon State University, Corvallis, OR 97331Copyright 2019 by Drew D. Penney and Lizhong ChenAll Rights Reserved
Fig. 1. Publications on machine learning applied to architecture (forworks examined in Section 3) tinue to provide breakthroughs as promising applicationsare explored.The paper is organized as follows. Section 2 providesbackground on ML and existing models to build intuition onML applicability to architectural issues. Section 3 presentsexisting work on ML applied to architecture. Section 4then compares and contrasts implementation strategies inexisting work to highlight significant design considerations.Section 5 identifies possible improvements and extensionsto existing work as well as promising, new applications forfuture work. Finally, Section 6 concludes.
ACKGROUND
Machine learning has been rapidly adopted in many fieldsas an alternative approach for a diverse range of prob-lems. This fundamental applicability stems from the pow-erful relationship learning capabilities of ML algorithms.Specifically, ML models leverage a generic framework inwhich these models learn from examples, rather than ex-plicit programming, enabling application in many tasks,including those too difficult to representing using standardprogramming methods. Furthermore, using this genericframework, there may be many possible approaches forany given problem. For example, in the case of predictingIPC for a processor, one can experiment with a simplelinear regression model, which learns a linear relationshipbetween features (such as core frequency and cache size) a r X i v : . [ c s . A R ] S e p and the instructions-per-cycle (IPC). This approach maywork well or it may work poorly. In the case it workspoorly, one can try different features, non-linear featurecombinations (such as core frequency times cache size),or a different model entirely, with another common choicebeing an artificial neural network (ANN). This diversity inpossible approaches enables adjustment of models, modelparameters, and training features to match the task at hand. The learning approach and the model are both fundamentalconsiderations in applying machine learning to any prob-lem. In general, there are four main categories of learn-ing approaches: supervised learning, unsupervised learn-ing, semi-supervised learning, and reinforcement learning.These approaches can be differentiated by what data is usedand how that data is used to facilitate learning. Similarly,many appropriate models may exist for a given problem,thus enabling significant diversity in application basedon the learning approach, hardware resources, availabledata, etc. In the following, we introduce these learningapproaches and several significant models for each learningapproach, focusing on approaches with proven applicability.Implementation details are considered later in Section 4.
Supervised learning : In supervised learning, the modelis trained using input features and output targets, with theresult being a model that can predict the output for new,unseen inputs. Common supervised learning applicationsinclude regression (predicting a value such as processorIPC) and classification (predicting a label such as the op-timal core configuration for application execution). Featureselection, discussed in Section 2.3, is particularly importantin these applications as the model must learn to predictsolely based on feature values.Supervised learning models can be generalized intofour categories: decision trees, Bayesian networks, supportvector machines (SVMs), and artificial neural networks [1].Decision trees use a tree structure where each node repre-sents a feature and branches represent a value (or range ofvalues) for that feature. Inputs are therefore classified bysequentially following branches based on the value of thefeature being considered at a given node. Bayesian networksinstead embed conditional relationships into a graphicalstructure; nodes represent random variables and edges rep-resent conditional dependence between these variables. Aperformance prediction model, for example, can conditionprediction for new applications on learned distributionsfor unobserved variables (i.e., underlying factors affectingperformance) from other applications, as in [2]. SVMs aregenerally known for their function rather than a particulargraphical structure (as in decision tree and Bayesian net-works). Specifically, SVMs learn the best dividing line (in2-D) or hyperplane (in high dimensions) between examples,then uses examples along this hyperplane to make new pre-dictions. SVMs can also be extended to non-linear problemsusing kernel methods [3] as well as multi-class problems.Finally, artificial neural networks (or simply neural net-works) represent a broad category of models that are, again,defined by their structure, which is reminiscent of neuronsin the human brain; layers of nodes/neurons are connectedusing links with learned weights, enabling particular nodesto respond to specific input features. Simple perceptronmodels contain just one weight layer, directly convertingthe weighted sum of inputs into an output. More complexDNNs include several (or many) layers of these weighted sums. Additional variants such as convolutional neural net-works (CNNs) incorporate convolution operations betweensome layers to capture spatial locality while recurrent neuralnetworks re-use the previous output to learn sequences andlong-term patterns. All these supervised learning modelscan be used in both classification and regression tasks,although there are some distinct high-level differences. Vari-ants of SVMs and neural networks tend to perform betterfor high-dimension and continuous features and also whenfeatures may be nonlinear [1]. These models, however, tendto require more data compared to Bayesian networks anddecision trees.
Unsupervised learning : Unsupervised learning uses justinput data to extract information without human effort.These models can therefore be useful, for example, in reduc-ing data dimensionality by finding appropriate alternativerepresentations or clustering data into classes that may notbe obvious for humans.Thus far, the primary two unsupervised learning modelsapplied to architecture are principal components analysis(PCA) and k-means clustering. PCA provides a methodto extract significant information from a dataset by deter-mining linear feature combinations with high variance [4].As such, PCA can be applied as an initial step towardsbuilding a model with reduced dimensionality, a highlydesirable feature in most applications, albeit at the cost ofinterpretability (discussed in Section 4). K-means clusteringis instead used to identify groups of data with similar fea-tures. These groups may be further processed to generalizebehavior or simplify representations for large datasets.
Semi-supervised learning : Semi-supervised learningrepresents a mix of supervised and unsupervised methods,with some paired input/output data, and some unpaired in-put data. Using this approach, learning can take advantageof limited labeled data and potentially significant unlabeleddata. We note that this approach has, thus far, not yetfound application in architecture. Nevertheless, one workon circuits analysis [5] presents a possible strategy thatcould be adapted in future work.
Reinforcement Learning : In reinforcement learning, anagent is sequentially provided with input based on anenvironment state and learns to perform actions that op-timize a reward. For example, in the context of memorycontrollers, the agent replaces traditional control logic. Inputcould include pending reads and writes while actions couldinclude standard memory controller commands (row read,write, pre-charge, etc.). Throughput could then be optimizedby including it in the reward function. Given this setup,the agent will potentially, over time, learn to choose controlactions that maximize throughput.Reinforcement learning models applied to architecture,as a whole, can be understood using a representation basedon states, actions, and rewards. The agent attempts to learna policy function π , which defines the action a to take at agiven state s , based on a received reward r [6]. A learnedstate-value function, following the policy, is then given as V π ( s ) = E [ (cid:88) t ≥ γ t ∗ r t | s = s, π ] (1)where γ is a discount factor ( ≤ ), which dictates how muchthe model should consider future rewards. The cumulativerewards are then maximized by learning an optimal policy π ∗ that satisfies π ∗ ( s ) = arg max π E [ (cid:88) t ≥ γ t ∗ r t | s = s, π ] . (2) Various models may implement different approaches tolearn this optimal policy, but largely address the sameproblem of maximizing rewards. Q-learning is a noteworthyexample that models an action-value function by estimatingthe value of an individual action, from a given state.
Supervised (and semi-supervised) learning methods relyupon input data features to model relationships and gener-ate predictions. Consequently, approaches for feature selec-tion can substantially impact model performance, includingconcerns such as over-fitting and computational overhead,as well as more abstract concerns, such as feature inter-pretability. In some works, feature selection is entirely basedon expert knowledge. Additional, more general, approachescan either supplant or supplement expert knowledge.One set of approaches, called filter methods, considersfeatures individually using metrics involving statistical cor-relation or information theoretic criteria such as mutualinformation. These approaches are usually the least com-putationally intensive so may be preferred for very largefeature sets, but model performance may be sub-optimalsince evaluation criteria in filter methods do not considerfeature context [7]; two features that provide little benefitindividually may be beneficial together. Many alternativeapproaches therefore consider feature subsets.Wrapper methods provide a black-box method for fea-ture selection by directly assessing the performance of alearning model [7]. Commonly applied greedy approachesinclude forward selection and backward elimination. In for-ward selection, features are progressively added to selectedfeature subset based on improvement to the overall learningmodel. Conversely, backward elimination removes featuresprogressively that provide little benefit.Embedded methods integrate feature selection into thelearning model to provide a trade-off between filter andwrapper methods [8]. Regularization is a widely used em-bedded method that allows the learning model to be fit,while simultaneously forcing feature coefficients to be small(or zero). Features with zero coefficient values can then beremoved. This method eliminates iterative feature selectionpresent in wrapper methods, which can have high compu-tational requirements [7].
ITERATURE R EVIEW
This section reviews existing work that applies machinelearning to architecture. Work is organized by sub-system(when applicable) or primary objective. We focus on designand optimization, but also introduce general performanceprediction work.
Cycle-accurate simulators are commonly used in systemperformance prediction, but require several orders of mag-nitude more time than native execution. ML can offset thispenalty through a trade-off between simulation time andaccuracy. In general, ML can reduce execution time by 2-3 orders of magnitude with relatively high accuracy (taskdependent, typically > ). Early work by Ipek et al. [9]modeled architectural design spaces using an ANN ensem-ble (a group of ANN predictors). Models were trained onapproximately 1% of the design space, then predicted CMPperformance with 4-5% error for random points, albeit only in that specific configuration space. When combined withSimPoints, predictions exhibit slightly higher error, but thesimulated instruction count is further reduced. Ozisikyilmazet al. [10] additionally predicted SPEC performance forfuture systems that may be poorly modeled by existingsimulators. Evaluation was limited to randomly-sampleddata with relatively simple linear regression and neuralnetwork models, but nevertheless demonstrated advantagesfor pruned neural networks compared to standard single-layer models (as in [9]). Several other ML approaches havealso been tested. Eyerman et al. [11] proposed a mechanistic-empirical model for processor CPI prediction. In this ap-proach, they used a generic mechanistic model with param-eters inferred by a regression model. Their model is limitedto single-core performance prediction, but improves accu-racy, ease of implementation (compared to purely mechanis-tic models), and interpretability (compared to purely empir-ical models). Zheng et al. [12], [13] explored cross-platformprediction from Intel/AMD to ARM processors using linearregression. Their first approach [12] made predictions basedon a local neighborhood of examples around the target pointto approximate non-linear behavior. They later [13] em-phasized phase-level prediction, assuming that phase-levelbehavior would be approximately linear. Notably, averageerror for cycle count predictions is less than 1% using phase-level profiling. This approach is, however, restricted to asingle target architecture and requires source code for phase-level analysis, leaving significant opportunities for futurework. Finally, recent work by Agarwal et al. [14] introduceda method to predict parallel execution speedup using single-threaded execution characteristics. They trained separatemodels for each thread count using application-level per-formance counters. Although neural networks were omitteddue to limited data, evaluation found that Gaussian processregression still provided promising results, particularly forhigh thread counts. Design Space Exploration : GPU design space explorationhas proven to be a particularly favorable application forML due to a highly irregular design space; some kernelsexhibit relatively linear scaling while others exhibit verycomplex relationships between configuration parameters,power, and performance [15], [16], [17]. Jia et al. [15] pro-posed Stargazer, a regression-based framework based onnatural cubic splines. Stargazer randomly samples approx-imately 300 points from a target design space (933K pointsin evaluation) for each application, then applies stepwiseregression on these points. Notably, the framework achievesunder 3.8% average performance prediction error. Wu et al.[16] instead explicitly modeled scaling for compute units,core frequency, and memory frequency. Scaling data fromtraining kernels was processed using k-means clustering togroup kernels by scaling behavior. An ANN then classifieskernels into these clusters, allowing new kernels to beclassified and predictions made using cluster scaling factors.This approach, in contrast to Jia et al. [15], therefore requiresjust a few samples for new applications. Jooya et al. [17],similar to Jia et al. [15], considered a per-application perfor-mance/power prediction model, but additionally proposeda scheme to predict per-application Pareto fronts. ManyANN-based predictors were trained and the most accuratesubset was used as an ensemble for prediction. Predictionaccuracy was later improved by sampling points within athreshold of the previously predicted Pareto-optimal curve.
Lin et al. [18] combined a performance predicting DNN witha genetic search scheme to explore memory controller place-ment. The DNN was used as a surrogate fitness function,obviating slow system simulations. The resulting placementimproves system performance by 19.3%.
Cross-Platform Prediction : Porting applications for ex-ecution on GPUs is a challenging task with potentiallyuncertain benefits over CPU execution. Work has there-fore examined methods to predict speedup or efficiencyimprovements using just CPU execution behavior. Baldiniet al. [19] cast the problem as a classification task, train-ing a modified nearest-neighbor and a support vector ma-chine (SVM) model to determine, based on a threshold,whether GPU implementation would be beneficial. Usingthis approach, they predicted near-optimal configurations91% of the time. In contrast, Ardalani et al. [20] traineda large ensemble of regression models to directly predictGPU performance for the code segment. Although severalcode segments exhibit high error, the geometric mean ofthe absolute value of the relative error is still just 11.6%and the model successfully identifies several code segments(both beneficial and non-beneficial) that are incorrectly pre-dicted by human experts. Later work by Ardalani et al. [21]introduced a completely static-analysis-based frameworkusing a random forest model for binary classification. Thisapproach eliminates both dynamic profiling and humanguidance, instead using features such as instruction mix,branch divergence estimates, and kernel size to provide 94%accuracy for binary speedup classification (using a speedupthreshold of 3).
GPU Specific Prediction & Classification : O’Neal et al.[22] presented a methodology for next-generation GPU per-formance prediction as cycles-per-frame (CPF) for DirectXapplications. They focused on Intel GPUs, profiling earlier-generation architectures (e.g., Haswell GT2) to train next-generation predictors. They found that different models(i.e., linear vs non-linear) can produce more accurate resultsdepending on the prediction target (Broadwell GT2/GT3 vsSkylake GT3), with the best performing models achievingless than 10% CPF prediction error. Recent work by Li etal. [23] presented a re-evaluation of commonly acceptedknowledge of GPU traffic patterns. They used a CNN andt-distributed stochastic neighbor embedding on heatmap-transformed traffic data, identifying eight unique patternswith 94% accuracy.
Scheduling : GPU processing-in-memory (PIM) archi-tectures can benefit from high memory bandwidth withreduced data movement energy. Despite this benefit, poten-tial limitations on PIM compute capabilities may introducecomplex trade-offs between performance and energy whenscheduling execution on various resources. For this reason,Pattnaik et al. [24] proposed an approach using a regressionmodel to classify core affinity, thus dividing the workload,and an additional regression model to predict executiontime, enabling dynamic task migration. Performance andenergy efficiency are improved by 42% and 27%, respec-tively, over a baseline GPU architecture. Further improve-ments are possible by improving core affinity classificationaccuracy (compared to regression).
Caches : Heuristic approaches for caching can incur perfor-mance penalties due to dramatic workload variance. MLapproaches can learn these intricacies and offer superior performance. Peled et al. [25] proposed a prefetcher ex-ploiting semantic locality (data structures) using contex-tual bandits (a simple RL variant), correlating contextualinformation and candidate addresses for prefetching. Im-plementation uses a two-level indexing method to dynam-ically control state information, allowing online featureselection with some additional overhead. Zeng and Guo[26] proposed a long short-term memory (LSTM) model(a recurrent neural network variant) for prefetching basedon local history and offset-delta tables. Evaluation showedthat the LSTM model enables accurate predictions overlonger sequence and higher noise resistance than priorwork. Several concerns relating to overhead and warm-uptime are addressed, with potential solutions remaining forfuture work. Similarly, Braun et al. [27] extensively exploredLSTM prefetching accuracy under several common accesspatterns. Experiments considered the impact of lookbacksize (access history window) and LSTM model size forseveral noise levels and independent access stream counts.Recent work by Bhatia et al. [28] synthesized traditionalprefetchers with a perceptron-based prefetch filter, allowingaggressive predictions without degrading accuracy. Evalua-tion confirmed substantial coverage and IPC benefits offeredby the proposed scheme, with 9.7% IPC speedup over thenext best prefetcher when referenced to a no-prefetchingfour-core baseline. ML has similarly been applied to datareuse policies. For example, Teran et al. [29] predicted LLCreuse with a perceptron model. In this approach, inputfeatures are hashed to access saturating weight tables thatare incremented/decremented based on correct/incorrectreuse prediction. These features are chosen empirically andshown to significantly impact performance, thus presentingan option for further optimization. Wang et al. [30] predictedreuse prior to cache entry, only storing data in the cacheif there was predicted reuse. They used decision trees asa low-cost alternative to ensemble models, achieving 60-80% reduction in writes. Additional research has exploredthe growing performance bottleneck in translation lookasidebuffers (TLBs). Margaritov et al. [31] proposed a schemefor virtual address translation in TLBs based on learnedindices [32]. Evaluation showed nearly 100% accuracy forpredicted indices, but practical implementation will requirededicated hardware to reduce calculation overhead (and isleft for future work).
Schedulers & Control : Controllers for memory andstorage systems influence both device performance andreliability, thus representing another strong application forML models compared with heuristics. Ipek et al. [33] firstproposed an RL approach for memory controllers to capturethe balance between concurrency, delay, and several otherfactors. The model predicted optimal actions (precharge, ac-tivate, row read/write), improving system performance by15% (in a two-channel system) over prior work. Mukundanand Martinez [34] later built upon Ipek’s work, generalizingthe reward function to optimize energy, fairness, etc. Theyalso added power-up and power-down actions to enable afurther 8.6% improvement in performance and a significantimprovement in energy efficiency. Related work optimizescommunication energy between memory/storage and othersystems using ML. Manoj et al. [35] proposed a Q-learningmethod for dynamic voltage swing control in through-silicon-interposer transmission lines. Predictions for powerand bit error rate were quantized, then provided as inputto the model to determine a new voltage level. Althoughtheir approach requires significant quantization to mini- mize overhead, they still achieved 15.1% energy savingscompared to a static voltage baseline. Wang and Ipek [36]reduce data movement energy through online clusteringand encoding. Several clusters are continuously updated ata bit-level using majority voting for data in that cluster. Thetotal number of transmitted 1s is then minimized by XORingnew data with the closest learned cluster center. Kang andYoo [37] applied Q-learning to manage garbage collectionin SSDs by determining optimal periods of inactivity. Keystates are kept in the Q-table using LRU replacement, al-lowing a vast state space and, ultimately, a 22% average taillatency reduction over the baseline. Many states are, how-ever, observed only once per workload, suggesting potentialbenefits using deep Q-learning (DQL). Other work directlyconsidered system reliability. For example, Deng et al. [38]proposed a regression-based framework to dynamically op-timize performance and lifetime in non-volatile memories.Their approach used phase-based application statistics tomanage several conflicting policies for write latency, writecancellation, endurance, etc., guaranteeing a minimum life-time with modest performance/energy improvements. Xiaoet al. [39] proposed a method for disk failure predictionusing an online random forest. They trained their modelusing a disk status window to account for imprecision inrecorded failure date, enabling accurate predictions of soon-to-be faulty drives. Comparison against other random forestupdating schemes (e.g., updating once a month) highlightedaccuracy benefits from consistent training that may be ex-tended to related domains.
Branch Prediction : Branch prediction is a noteworthyexample of current ML application in industry, with ac-curacy surpassing prior state-of-the-art non-ML predictors.The perceptron-based branch predictor was first proposedby Jim´enez and Lin [40] as a promising high-accuracy al-ternative to two-level schemes using pattern history tables.Later research by St. Amant et al. introduced SNAP [41], aperceptron-based predictor implemented using analog cir-cuitry to enable an efficient and practically feasible design.Perceptron weights and branch history were used to drivecurrent-steering DACs that perform the dot product as thesum of currents. Jim´enez [42] further optimized this designusing a per-branch history table, dynamic coefficients forhistory importance, and a dynamic learning threshold. Theoptimized design achieves 3.1% lower MKPI than L-TAGE.Recent work with perceptron-based predictors by Garza etal. [43] explored bit-level prediction for indirect branches.Possible branch targets are evaluated using their similarity(dot product) with the combined weights from eight featuretables incorporating local and global history, ultimatelyreducing MKPI by 5% compared to ITTAGE. Currently,state-of-the-art conditional branch predictors (e.g., TAGE-SC-L [44]) still hide significant IPC gains (14.0% for an IntelSkylake architecture) in just a few hard-to-predict (H2P)conditional branches [45]. Tarsa et al. [45] consequently pro-posed “CNN Helper” predictors that target specific H2Psusing simple two-layer CNNs. Results indicate strong appli-cability across diverse workloads and present a promisingarea for future work.
DVFS & Link Control : Modern computing systems ex-ploit complex power control schemes to enable increas-ingly parallel architectural designs. Heuristic schemes mayfail to exploit all energy-saving opportunities, particularlyin dynamic network-on-chip (NoC) workloads, leading to significant benefits through proactive ML-based control.Savva et al. [46] implemented dynamic link control usingseveral ANNs, each of which monitors a NoC partition.These ANNs used just link utilization to learn a dynamicthreshold to enable/disable links. Despite energy savings,their approach can cause high latency under dimension-ordered routing. DiTomaso et al. [47] relocated flit buffersto the links and dynamically controlled both link direc-tion and power-gating with per-router classification trees.Using a simple three-level tree to limit overhead, overallNoC power is reduced by 85% and latency is reduced by14% compared to a concentrated mesh. Winkle et al. [48]explored ML-based power scaling in photonic interconnects.Even a simple linear regression model provided promisingresults, negligibly reducing throughput (versus no power-gating) while reducing laser power consumption by 42%.Reza et al. [49] proposed a multi-level ANN control schemethat considered both power and thermal constraints ontask allocation, link allocation, and node DVFS. IndividualANNs classified appropriate configurations for local NoCpartitions while a global ANN classified optimal overallresource allocation. This scheme identifies the global opti-mal NoC configuration with high accuracy (88%), but usescomplex ANNs that could impact implementation. Clarket al. [50] proposed a router design for DVFS and eval-uated several regression-based control strategies. Variantspredicted buffer utilization, change in buffer utilization, ora combined energy and throughput metric. This work wasexpanded by Fettes et al. [51], who introduced an RL controlstrategy. Both regression and RL models enable beneficialtradeoffs, although the RL strategy is most flexible.
Admission & Flow Control : As with NoC DVFS, bothadmission and flow control can benefit from proactive pre-diction. Early work by Boyan and Littman [52] introducedQ-learning based routing in networks using delivery timeestimates from neighboring nodes, noting throughput ad-vantages over traditional shortest path routing for high traf-fic intensity. Several works have expanded upon Q-routing,observing application in dynamically changing NoC topolo-gies [53], improved capabilities in bufferless NoC fault-tolerant routing [54], and high-performance congestion-aware non-minimal routing [55]. More recent works haveinstead focused on injection throttling and hotspot preven-tion. For example, Daya et al. [56] proposed SCEPTER, abufferless NoC using single-cycle multi-hop paths. Theycontrolled injection throttling using Q-learning to maximizemulti-hop performance and improve fairness by reducingcontending flits. Future work could reduce Q-table over-head which scales with NoC size in their implementation.Wang et al. [57] instead used an ANN to predict optimalinjection rates for a standard buffered NoC. Additional pre-processing (to capture both spatial and temporal trends) andnode grouping enables high accuracy predictions (90.2%)and reduces execution time by 17.8% compared to an un-optimized baseline. Soteriou et al. [58] similarly exploredANN-based injection throttling to reduce NoC hotspots. TheANN was trained to predict hotspots while recognizingthe impact of proposed injection throttling and dynamicrouting, providing a holistic mitigation strategy. The modelprovides state-of-the-art results for throughput and latencyunder synthetic traffic, but limited improvement under real-world benchmarks, suggesting the potential for further opti-mization. Another Q-learning approach, proposed by Yin etal. [59], used DQL to arbitrate NoC traffic. They considereda wide range of features and rewards while noting that the proposed DQL algorithm is impractical due to over-head. Regardless, evaluation exhibited modest throughputimprovements over round-robin arbitration.
Topology & General Design : Several works also appliedML to higher-level NoC topology design, involving trade-offs between power and performance, with some furtherconsidering thermals. Das et al. [60] used a ML-based
ST AGE algorithm to efficiently explore small-world in-spired 3D NoC designs. In this approach, design alternatesbetween base/local search (adding/removing links in a hill-climbing approach) and meta search (predicting beneficialstarting points for local search using prior results). Thesame model was used again by Das et al. [61] to balancelink utilization and address TSV reliability concerns. TheSTAGE algorithm was then enhanced by Joardar et al. [62]to optimize a heterogeneous 3D NoC design. The modelsexplores multi-objective trade-offs between CPU latency,GPU throughput, and thermal/energy constraints. All threeworks [60], [61], [62] still rely upon hill-climbing for opti-mization. Recent work by Lin et al. [63] instead exploreddeep reinforcement learning in routerless NoC design. Theyused a Monte Carlo tree search to efficiently explore thesearch space and a deep convolutional neural network toapproximate both the action and policy functions, therebyoptimizing loop configurations. Further, the proposed deepreinforcement learning framework can strictly enforce de-sign constraints that may be violated by prior heuristic orevolutionary approaches. Rao et al. [64] investigated multi-objective NoC design optimization across a broad SoC fea-ture space (from bandwidth requirements to SoC area). MLmodels were trained using data from thousands of SoC con-figurations to predict optimal NoC designs based on perfor-mance, area, or both. Limited comparisons against human-expert designs did not consider alternative techniques (e.g.,AMOSA [65]), yet exhibited some promising results, moti-vating research into effective features and models as well asfurther comparisons against alternative techniques.
Performance Prediction : Existing NoC models basedon queuing theory are generally accurate, but rely on as-sumptions of traffic distribution that may not hold for realapplications [66]. Qian et al. [66] emphasized how ML-basedapproaches can relax the assumptions made by queueingtheory models. They constructed a mechanistic-empiricalmodel based on a communication graph, using supportvector regression (SVR) to relate several features and queu-ing delays. Evaluation showed lower error (3% error vs10% error) than an existing analytical approach. Sangaiahet al. [67] considered both NoC and memory configurationfor performance prediction and design space exploration.Following a standard approach, they sampled a small por-tion of the design space, then trained a regression modelto predict the resulting system CPI. Evaluation generallyshowed high accuracy, but lower accuracy for high-trafficworkloads (median error of 24%). Additional design spaceexploration exhibited promising results, reducing the designspace from 2.4M points to less than 1000.
Reliability & Error Correction : Overhead introduced byerror correction in NoCs can be significant, especially whenre-transmission is required. Several works have, therefore,explored ML-based control schemes. DiTomaso et al. [68]trained a decision tree to predict NoC faults using a widerange of parameters including temperature, utilization, anddevice wear-out. These predictions allow proactive encod-ing (on top of the baseline cyclic redundancy check) fortransmission that are likely to have errors. Wang et al. [69] adopted a similar strategy for dynamic error mitigation,but used an RL-based control policy to eliminate the needfor labeled training examples. Their approach provides anaverage of 46% dynamic power savings (17% better thanthe decision tree method [68]) compared with a static CRCscheme. In both cases, ML-based proactive control chosea more efficient scheme than CRC only. Wang et al. [70]subsequently proposed a holistic framework for NoC de-sign incorporating dynamic error mitigation, router power-gating, and multi-function adaptive channel buffers (MFACbuffers). They emphasized comprehensive benefits throughsynergistic integration/control of several architectural inno-vations, thus achieving substantial improvements in latency(32%), energy-efficiency (67%), and reliability (77% higherMean Time to Failure) compared to a SECDED baseline.
Energy Efficiency Optimization : Significant work has be-gun to consider systems in which workload execution isconstrained by total energy consumption rather than pro-cessing resources. Control schemes incorporating ML haveshown promise in optimizing energy efficiency with min-imal performance reduction, often enabling 60-80% reduc-tions in the energy-delay product compared to race-to-idleschemes. Won et al. [71] introduced a hybrid ANN + PI(proportional-integral) controller scheme for uncore DVFS.They initially trained the ANN offline, then refined pre-dictions online using the PI controller. This hybrid schemewas shown to reduce the energy-delay product by 27%compared to a PI controller alone, with less than 3% per-formance degradation compared to the highest V/F level.Pan et al. [72] implemented a power management schemeusing a multi-level RL algorithm. Their method propagatesindividual core states up a tree structure while aggregatingQ-learning representations at each level. Global allocation ismade at the root, then decisions are propagated back downthe tree, enabling efficient per-core control. Bailey et al. [73]addressed power efficiency in heterogenous systems. Simi-lar to Wu et al. [16], they clustered kernels by their scalingbehavior to train multiple linear regression models. Runtimeprediction used two sample configurations, one from CPUexecution and one from GPU execution, to determine theoptimal configuration. Lo et al. [74] focused on energy-efficiency optimization for real-time interactive workloads.They used linear regression to model execution time basedon annotations and code features, enabling stricter servicelevel guarantees at the cost of applicability when sourcecode is unavailable. Mishra et al. [75] also addressed real-time workloads, combining control theory and several ML-based models. Their framework was realized by offloadinglearning to a server, allowing low overhead DVFS thatreduces energy consumption by 13% compared to the bestprior approach. Related work by Mishra et al. [2] applieda comparatively complex hierarchical Bayesian model tocombine both offline and online learning. In this approach,they accepted a high execution time penalty (0.8s) in or-der to provide significantly more accurate predictions thanonline or offline training alone. This approach thereforetargeted longer executing workloads, but can provide morethan 24% energy savings over the next best approach. Baiet al. [76] implemented a RL-based DVFS control policyadapted to a novel voltage regulator hierarchy using off-chip switching regulators and on-chip linear regulators. In-dividual RL agents adapt to a dynamically allocated powerbudget determined by a heuristic bidding approach. The design was enhanced using adaptive Kanerva coding [77]to limit area/power overhead and experience sharing toaccelerate learning. Chen and Marculescu [78] (later Chenet al. [79]) explored an alternative two-level strategy for RL-based DVFS. Similar to Bai et al. [76], they used RL agentsat a fine-grain core level to select a V/F level based on anallocated share of the global power budget. They achievedfurther improvement by allocating power budget using aperformance-aware, albeit still heuristic-based, variant thatconsiders relative application performance requirements.Imes et al. [80] explored single-application system energyoptimization for a broader range of configurations optionsincluding socket allocation, HyperThread usage, and pro-cessor DVFS. They identified several useful models, whilenoting that further work could optimize models and pa-rameters. Analysis also provided insight into the benefitfrom single-model multi-resource optimization, particularlyfor neural networks. Finally, recent work by Tarsa et al.[81] considered an ML framework for post-silicon CPUadaptations using firmware updates to microcontroller-implemented models. Significant accommodations for sta-tistical blindspots limit the rate of service-level-agreementviolations while optimizing performance per watt for bothgeneral-purpose and application-specific deployment.
Task Allocation and Resource Management : In addi-tion to energy control, ML offers an approach to allocateresources to tasks or tasks to resources by predicting theimpact of various configurations on long-term performance.Lu et al. [82] proposed a thermal-aware Q-learning methodfor many-core task allocation. The agent considered onlycurrent temperature (i.e., no application profiling or hard-ware counters), receiving higher rewards for task assign-ments resulting in greater thermal headroom. Evaluationindicated an average 4.3 ◦ C reduction in peak temperaturecompared to a heuristic approach. Nemirovsky et al. [83]introduced a method for IPC prediction and task schedul-ing on a heterogeneous architecture. They predicted IPCfor all task arrangements using ANNs, then selected thearrangement with the highest IPC. Evaluation highlightedsignificant throughput gains ( > . x ) using a deep (but highoverhead) neural network, indicating one possible applica-tion for pruning (discussed in Section 5.2). Recent work hasalso explored multi-level scheduling in hybrid CPU-GPUclusters. Zhang et al. [84] proposed a deep reinforcementlearning (DRL) framework to divide video workloads, firstat the cluster level (selecting a worker node) and thenat the node level (CPU vs GPU). The two DRL modelsact separately, but still work together to optimize overallthroughput. Allocating resources to tasks is another possibleapproach. Early work by Bitirgen et al. [85] considered asystem with four cores and four concurrent applications. Intheir approach, per-application ANN ensembles predictedIPC for 2,000 configurations at each interval (500K cycles).IPC predictions were then aggregated to choose the highestperforming overall system configuration. Scaling concernsfor per-application ensembles and exponentially increasingconfiguration spaces could be addressed in future work. Re-cent research has also considered low-level co-optimizationinvolving multiple components/resources. For example,Jain et al. [86] explored concurrent optimization of coreDVFS, uncore DVFS, and dynamic LLC partitioning. Theseoptions are optimized by individual agents (potentiallylimiting co-optimization opportunities) at a relatively largeinterval (1B instructions). Evaluation nevertheless indicatednoteworthy reductions in energy-delay-product through multi-resource optimization. Finally, work by Ding et al.[87] established a somewhat contradictory trend betweenmodel accuracy and system optimization goals based onimprovements for data scarcity and model bias. Specifically,they found that state-of-the-art models exhibit diminish-ing returns for accuracy and instead benefit from domainknowledge (e.g., focus sampling on the optimal front). Chip Layout : Work by Wu et al. [88] demonstrateduses for ML in chip layout, deviating from the commonapplications including control, prediction, and design spaceexploration. They used k-means to cluster flip-flops duringphysical layout, minimizing clock wirelength at the expenseof signal wirelength, noting that clock networks can con-sume more than 40% of chip power. They included con-straints on maximum flip-flop displacement and cluster size,generating designs with 28.3% reduced displacement, 3.2%reduced total wirelength, and 4.8% reduced total switchingpower compared to the prior state-of-the-art approach.
Security : Malware detection, a traditionally software-based task, has been explored using machine learning toenable reliable hardware-based in-execution detection. Forexample, Ozsoy et al. [89] test both logistic regression (LR)and neural network classifiers trained on low-level hard-ware counters. Optimization based on reduced precisionand feature selection provides high accuracy (100% malwaredetection and less than 16% false positives) with minimaloverhead (0.04% core power and 0.19% core logic area) forthe LR model.
Approximate computing has many facets, including circuitlevel approximation (such as reduced precision adders),control level approximation (relaxing timings, etc), anddata level approximation. Methods using ML generally fallwithin this last category, offering a powerful function/loopapproximation technique that commonly provides 2-3 timesapplication speedup and energy reduction with limited im-pact on output quality. Esmaeilzadeh et al. [90] introducedNPU, a new approach to programmable approximationusing neural networks. They developed a framework torealize Parrot transformations that translate annotated codesegments into neural networks approximations. Tightly in-tegrating the NPU with the CPU allowed an average 2.3xspeedup and 3.0x energy reduction in studied applications.This framework was later extended by Yazdanbakhsh et al.[91] to implement neural approximation on GPUs. Neu-ral approximation was integrated into the existing GPUpipeline, enabling component re-use and approximately2.5x speedup and 2.5x reduced energy. Grigorian et al.[92] presented a different approach for a multi-stage neu-ral accelerator. Inputs are first sent through a relativelylow accuracy/overhead neural accelerator, then checked forquality; acceptable results are committed, while low qualityapproximations are forwarded to an additional, more pre-cise, approximation stage. The problem with these worksis that error is either constant [90], [91] or requires severalstages with potentially redundant approximation [92]. Forthat reason, Mahajan et al. [93] introduced MITHRA, a co-designed hardware-software control framework for neuralapproximation. MITHRA implements configurable outputquality loss with statistical guarantees. ML classifiers predictindividual approximation error, allowing comparison to aquality threshold. Recent work by Oliveira et al. [94] alsoexplored approximation using low-overhead classificationtrees. Even with software-based execution, they achieved application speedup comparable to an NPU [90] hardwareimplementation. Finally, ML has also been used to mitigatethe impact of faults in existing approximate accelerators.Taher et al. [95] observed that faults tend to manifestin a similar manner across many input test vectors. Thisobservations enables effective error compensation using aclassification/regression model to correct output based onpredicted faults for a given input.
NALYSIS OF C URRENT P RACTICE
This section examines varying techniques employed in ex-isting work. These comparisons emphasize potentially use-ful design practices and strategies for future work.Work is divided into two categories that represent a nat-ural division in design constraints and operating timescalesand therefore correspond to differing design practices. Thefirst category, online ML application, encompasses workthat directly applies ML techniques at run-time, even iftraining is performed offline. Design complexity in thiswork is therefore inherently limited by practical constraintssuch as power, area, and real-time processing overhead. Thesecond category, offline ML application, instead applies MLto guide architectural implementation, involving tasks suchas design and simulation. Consequently, models for offlineML application can exploit higher complexity and higheroverhead options at the cost of training/prediction time.
Model Selection : Online ML applications primarily useeither decision trees or ANNs, in the case of supervisedlearning models, and either Q-learning or deep Q-learning,in the case of RL models. Note that tasks for these learn-ing approaches are not necessarily disjoint, particularly forcontrol. Fettes et al. [51] cast DVFS as both a supervisedlearning regression task and as a reinforcement learningtask. The supervised learning approach predicted bufferutilization or change in buffer utilization to determine anappropriate DVFS mode. In contrast, the RL approach di-rectly used DVFS modes as the action space. Both modelscan perform well, but the RL model is more universallyapplicable since the energy/throughput trade-off can betailored to application needs and does not require thresholdtuning. This certainly does not mean that RL is a one-model-fits-all solution. Supervised learning models find strongapplication in function approximation [90], [91], [92], [94]and branch prediction tasks [41], [42], which are far lesssuitable (if not impossible) to approach using RL since thesetasks cannot be represented well as a sequence of actions.
Implementation & Overhead : Implementation of onlineML applications highlight limitations in data availability,storage space for models, etc., indicating the need for anefficient, and generally low complexity, model. These lim-itations will likely become more important to consider asmore research moves towards real-world implementation.Several NoC-based works [46], [56], [71] have applieddifferent methods for global data collection to support MLmodels. Daya et al. [56] implemented self-learning injectionthrottling using a separate bufferless starvation networkthat carries a starvation flag, encoded as a one-hot N-bitvector for a network with N nodes. These starvation vectorsare propagated to all nodes, allowing individual node-based Q-learning agents to determine appropriate injectionthrottling. Soteriou et al. [58] similarly used a dedicatednetworks to collect buffer utilization and VC occupancy statistics. The ANN-based DVFS control proposed by Wonet al. [71] eschewed an additional status/data network byencoding data into unused bits in standard packet headers.Data is opportunistically collected by a central control unitas packets pass through its router. This method introducespotential concerns about data staleness, but prior work [96]observed nearly identical performance to omniscient datacollection for sufficiently large (50K cycle) control windows.Smaller time windows can be accommodated by sendingdedicated packets, as done by Savva et al. [46].Implementation can also consider the use of either hard-ware or software models. Implementation using dedicatedhardware will usually experience lower execution timeoverhead, but there are other considerations. Esmaeilzadehet al. [90] implemented a neural processor (NPU) for func-tion approximation using a dedicated hardware module.They also considered a software implementation, but ob-served a prohibitive increase in instruction count for soft-ware execution compared to a baseline x86 function. Laterwork by Oliveira et al. [94] found that function approxi-mation using a simple classification tree can achieve com-parable results to NPU [90] for application speedup anderror rate in several applications (albeit somewhat worse onaverage). Their purely software implementation highlights atrade-off between area/power and accuracy/performance.Won et al. [71] observed a similar trade-off, choosing toimplement an ANN in software using an on-die microcon-troller rather than dedicated hardware. This implementationconsumes several orders of magnitude more cycles (15Kcycles for inference), but requires 50mW less average powerthan a hardware implementation.Approaches for hardware implementation may also varybased on the task. A “standard” ANN implementation isobserved in work by Savva et al. [46]. They incorporateda finite state machine for control, an array of multiply-accumulate (MAC) units for calculation, a register array toload and store results, and a lookup-table-based activationfunction. Both MAC array width and calculation precisioncan be adjusted to balance power/area and accuracy/speed.In contrast, St. Amant et al. [41] implemented a percep-tron branch predictor using a mixed signal design. Theyrealized dot products in analog circuitry, leveraging tran-sistor sizing and current summing to achieve a feasibleoverhead. Variance also exists in hardware for RL mod-els. The “standard” Q-learning implementation requires alookup table to store state-action values. Ipek et al. [33] aswell as Mukundan and Martinez [34] instead used
CM AC [97], replacing a potentially extensive Q-learning table withmultiple coarse-grain overlapping tables. This approach alsoincluded hashing, using hashed state attributes to index theCMAC tables. Taken together, these two methods balancegeneralization and overhead, although may introduce colli-sions/interference depending on the task. Further pipelin-ing the hashing, CMAC table lookup, and calculation allowsmore possible action-values to be evaluated per cycle.
Optimization : Online ML applications with online train-ing benefit from adaptivity to run-time workload charac-teristics. Despite these benefits, low model accuracy cannegatively impact system performance, most notably at thestart of execution or during periods of high variance inworkload characteristics. Adaptations to control and learn-ing can be considered to avoid these detrimental impacts.Some RL-based work [25] considered mitigating the impactof poor actions during exploration by introducing “shadow”operations. These operations are low confidence actions suggested by the model that are still used in model updatesbut not executed by the system. Consequently, the modelgains feedback on the goodness of the action without nega-tively impacting the system. In a supervised learning basedcontrol task, Won et al. [71] trained an ANN online usingcontrol actions made by a PI controller, which exhibits farless start-up delay. Following training, control decisions aremade using a hybrid combination based on error and con-sistency, allowing complementary control. In the simplestcase, checking the performance of a default configuration,as in [38], provides a guarantee that the ML model will notperform worse than the default, but can perform better.In most works, ML models replace existing approaches(commonly a heuristic). Nevertheless, several recent works[28], [45] have demonstrated significant advantages bycombining both traditional (non-ML) and ML approaches.These improvements are derived from the orthogonalprediction/decision-making capabilities of the two ap-proaches, thus enabling synergistic performance improve-ments. This method can also enable lower-cost ML appli-cation by focusing on particular shortcomings in traditionalapproaches. Both recent works [28], [45] consider just branchprediction, thus significant opportunities exist to explorethis potential co-design paradigm.
Model/Feature Selection : Offline ML applications gener-ally exhibit substantial model/feature diversity since themodel itself is not tied to a particular architecture. Modeland feature selection therefore focuses more on maxi-mizing model accuracy while minimizing overall learn-ing/prediction time. Design space exploration, in particular,can be approached using either iterative search methodsfor direct optimization or supervised learning methods toselect optimal points based on the predicted optimality ofa design. Several works [60], [61], [62] used an iterative
STAGE [98] algorithm that optimizes local search for 3DNoC links by learning an evaluation function to predict localsearch results from a given starting point. Recent work hasinstead applied deep reinforcement learning [63] to router-less NoC design. The proposed Monte Carlo tree search,along with actions suggested by a convolutional neuralnetwork, provide a highly efficient search process. Parallelthreads are also utilized to scale design space explorationwith increasing computational resources. System-level de-sign space exploration has favored more standard super-vised learning approaches [17], [64], [67]. Specific modelchoices vary, with linear [17], [64] and non-linear [67] regres-sion models, as well as random forests and neural networks[64] finding implementation. As in online ML applications,discussed in Section 4.1, some tasks are naturally limited tosupervised learning methods. Cross-architecture predictionis an exemplar [12], [13], [15], [19], [20].
Optimization : The usefulness of an ML model in offlineML applications is largely determined by overhead relativeto traditional design approaches. Optimization thereforeprimarily focuses on improving data efficiency and overallmodel accuracy.Ensemble methods have been proposed in online ML ap-plications [38], but primarily find application in offline MLapplications as ensembles can be made arbitrarily large (rel-ative to available computation resources). Several optimiza-tions have been suggested to improve efficiency. Jooya et al.[17] trained many neural networks using slightly differentconfigurations and generated an ensemble using a subset of the models that generalized well and were most insensitiveto input noise. They further introduced outlier detectionby filtering predictions whose performance and/or powerpredictions differ greatly from the closest configuration intraining data. Ardalani et al. [20] instead kept all 100 modelsthat they trained, noting that models may be very strongpredictors in one application but weak predictors in another.They remedied this dilemma by selecting only the 60 closestindividual predictions to the median prediction.Sampling method optimization, while not unique toarchitecture tasks, are nevertheless important to considerin improving model accuracy. Sangaiah et al. [67] consid-ered potential systematic biases in their uncore performanceprediction model. Specifically, they observed that uniformrandom sampling may not adequately capture performancerelationships in a non-uniform configuration space (asin cache configurations using powers of two for sizing).They therefore used a low-discrepancy sampling technique,
SOBOL [99], to remove this systematic bias and preventperformance over-prediction for low-end configurations.
The powerful relationship learning capabilities offered byML algorithms enable black-box implementation in manytasks (i.e., without consideration for task-specific charac-teristics), but may fail to capitalize on additional domainknowledge that could improve interpretability or overallmodel performance. Additionally, in some applications, do-main knowledge can help identify aberrant behavior and,again, improve overall model usefulness. These themes arehighlighted in several specific works, but can be generallyapplicable for ML applied to architecture.One approach uses mechanistic-empirical models, syn-thesizing a domain knowledge based mechanistic frame-work with empirical ML based learning for specific param-eters. These models simplify implementation compared topurely mechanistic models [11], can avoid incorrect assump-tions made in purely mechanistic models [66], and can offerhigher accuracy than purely empirical models by avoidingoverfitting [11]. Eyerman et al. [11] also demonstrated howthese models can be used to construct CPI stacks, allowingmeaningful alternative design comparisons.Deng et al. [38], in their work predicting optimal NVMwrite strategies, presented a case for tuning ML modelsbased on task specific domain knowledge. Following ini-tial analysis, they discovered how a single configurationparameter (wear quota) can result in higher complexityand sub-optimal prediction accuracy for IPC and systemenergy, even with quadratic regression and gradient boost-ing models. Excluding wear quota from the configurationspace, then later applying it to the predicted optimal con-figuration, provided a 2-6% improvement in predictionaccuracy. Ardalani et al. [20] similarly examined inherentimperfections in their learning model for cross-platformperformance prediction. Some predictions can be easy forlearning models and hard for humans, representing an idealscenario for ML application; the converse can also be true.In both cases, ML application is strengthened by consideringtask characteristics.
UTURE W ORK
This section synthesizes observations and analysis fromSection 3 and Section 4 to identify opportunities and detailthe need for future work. These opportunities may come at the model level, exploiting improved implementationstrategies and learning capabilities, or at the applicationlevel, addressing the need for generalized tools or exploringaltogether new areas. Existing works generally apply ML at a single time-scale orlevel of abstraction. These limitations motivate investigationinto models and algorithms that capture the hierarchicalnature of architecture, both in terms of system design andexecution characteristics.
Perform Phase-level Prediction : Application analysisusing basic blocks [100] has long been a useful method forsimulation, made possible by identifying unique and repre-sentative phases in program execution. Phase-level predic-tion offers analogous benefits for ML applied to architec-ture. A few recent works, in particular, have demonstratedpromising results, with high accuracy for both performanceprediction [13] as well as energy and reliability (lifetime)[38]. In general, most work [2], [17], [67] has not yet adoptedphase-level prediction techniques (or does not explicitlymention their methodology). Specifically, future work couldexplore predictions for control and system reconfigurationbased on phase-level behavior, rather than either staticwindows [85] or application-level behavior [75], [101].
Exploit Nanosecond Scale : Coarse-grain ML, used inmany DVFS control schemes, provides significant benefitsover standard control-theory based schemes, yet fine graincontrol can provide even greater efficiency. Specifically,analysis by Bai et al. [76] indicated very rapid changes inenergy consumption, on the order of 1K instructions forsome applications. Exploiting these brief intervals requirescareful consideration for both the model and the algorithm.Future work may optimize existing algorithms such asexperience sharing [102] and hybrid/tandem control [71],or consider approaches more suited for novel models (e.g.,hierarchical models). These approaches could also enableadditional nanosecond-scale co-optimization opportunities,such as dynamic LLC partitioning, to extract further effi-ciency gains.
Apply Hierarchical & Multi-agent Models : Applicationexecution in computer systems naturally follows a hierar-chical structure in which, at the top level, tasks are allo-cated to cores, then cores are assigned dynamic power andresource budgets (e.g., LLC space), and finally, at the bot-tom level, data/control packets are sent between cores andmemory. Consequently, a single machine learning modelmay struggle to learn appropriate design/control strategies.Furthermore, in the case of reinforcement learning models,it can be exceedingly difficult to accurately assign credits tospecific low-level actions based on their impact on overallexecution time, energy efficiency, etc. One promising ap-proach in recent work is hierarchical models [103]. Hier-archical reinforcement learning models enable goal-directedlearning that is particularly beneficial in environments withsparse feedback (e.g., task allocation). Applying hierarchicallearning to architecture could therefore enable more effec-tive multi-level design and control. Multi-agent models areanother promising area in machine learning research. Thesemodels tend to focus on problems in which reinforcementlearning agents have only partial observability of theirenvironment. Although partial-observability may not be aprimary concern in individual computer systems, recentwork [104] has applied this concept to internet packet rout- ing and demonstrated convergence benefits via improvedcooperation between individual agents.
Increasingly complex models require effective strategies andtechniques to reduce overhead and enable practical imple-mentation. Model pruning and weight quantization, as dis-cussed below, are two particularly effective techniques withproven benefits in accelerators, while many other promisingapproaches are also being explored [105].
Explore Model Pruning : Model complexity can be alimiting factor in online ML applications. A standard Q-learning approach requires a potentially extensive table tostore action-values. Neural network based approaches forboth RL (in Deep Q-Networks) and supervised learningrequire network weight storage and additional processingcapabilities. Neural networks, in particular, are thereforegenerally constrained to a few layers in existing work, withmany using just one hidden layer [46], [71], [85], [93] andsome using one or two hidden layers [90], [91].Recent research on neural networks has demonstratedpromising methods to reduce model complexity throughpruning [106], [107]. The general intuition is that manyconnections are unnecessary and can therefore be pruned.Iteratively pruning a high-complexity network, then re-training from scratch on the sparse architecture achievesgood results, with some work demonstrating very highsparsity ( > Explore Quantization : Existing work primarily appliesquantization to state values in Q-learning to enable practicalQ-table implementation. Similarly, neural networks benefitfrom potential reduction in execution time, power, andarea by reducing multiply-accumulator precision. Recentworks, however, suggest a new spectrum of opportunitiesfor alternative hardware implementations based on reducedprecision models.Binary neural networks, for example, quantize weightsto be either +1 or -1, enabling computation based on bit-wise operations rather than arithmetic operations [108]. Anadditional approach considered quantizing neural networkweights into finite (but non-binary) subsets in order to re-place multiply operations with lookup-table accesses [109],allowing higher precision and lower execution time, albeitwith potentially higher area cost. Future work on ML appli-cation can exploit similar hardware implementations whileexploring optimal quantization levels for various tasks andcontrol schemes.
Existing machine learning tools (e.g., scikit-learn [110]) haveproven useful for relatively simple ML applications. Never- theless, complex design and simulation tasks require moresophisticated tools to enable rapid task-specific optimiza-tions using general-purpose frameworks. Enable Broad Application & Optimization : Purpose-built architectural tools, similar to heuristic design strate-gies, can be useful in enabling design, exploration, and sim-ulation that satisfies a common use case. These approachesmay still be limited in their application to a specific problem,optimization criteria, system configuration, etc. Given thefast-paced nature of architectural research (and machinelearning research), there is a need to develop more gener-alized tools and easily modifiable frameworks to addressbroader applications and optimization options.ML-based design tools are especially promising, withrecent works demonstrating successful application to im-mense design spaces (e.g., exceeding in [63]). Oppor-tunities for new design tools are not, however, limited tospecific architectural components. Chip layout is a notableexample in which even simple clustering algorithms candramatically outperform existing heuristic approaches [88].Future work can also continue to develop more broadlyapplicable tools for performance and power prediction.In particular, recent work on cross-platform performanceprediction [21] suggests the possibility for high predictionaccuracy with purely static features, thus representing an-other potential area for additional research. Enable Widespread Usage : Generalized tools enableadditional benefit by facilitating rapid design and evalua-tion. Using a machine learning approach, one might simplymodify training data (in a supervised learning setting) oraction/reward representation (in a reinforcement learningsetting) rather than exploring models, data representationstrategies, search approaches, etc., possibly without a priori machine learning experience. For example, recent work[63] envisioned reuse of a deep reinforcement learningframework for diverse NoC-related design tasks involvinginterposers, chiplets, and accelerators. While the frameworkmight not be compatible with all work, especially in novelareas, it may provide a better foundation for machine learn-ing application to architectures, especially for individualswith limited machine learning background.
Opportunities abound for future work to apply ML toboth existing and emerging architectures, replace heuristicapproaches to enable long-term scaling, and advance capa-bilities for automated design.
Explore Emerging Technologies : Several proposals [30],[37], [38], [39] establish how ML can be used to optimizeboth standard (energy, performance) and non-standard (life-time, tail-latency) criteria. These non-standard criteria areshown to be particularly problematic in emerging technolo-gies as these technologies cannot easily find widespreadimplementation without some reliability guarantees. Apply-ing ML to optimize both standard and non-standard criteriatherefore provides a method for future work to intelligentlybalance control strategies dynamically, rather than relyingupon a heuristic approach.
Explore Emerging Architectures : ML application toemerging architectures presents a similar benefit by en-abling rapid development, even with limited best-practiceknowledge, which may take time to develop. Work inlong-standing design areas, such as task allocation andbranch prediction, may incorporate best-practice domainknowledge to guide approaches, whether applying ML or some other traditional method. Best practices for emergingarchitectures may not be immediately obvious. For example,ML application to 2D photonic NoCs [48], 2.5D processing-in-memory designs [24], and 3D NoCs [60], [61], [62] haveall shown strong performance over existing approaches.Future work can explore ML application to novel concernssuch as connectivity and reconfigurability in interposers anddomain-specific accelerators.
Expand System-Level Approximate Computing : Asdiscussed in Section 3.6, ML applications for approximatecomputing have been mostly limited to function approxima-tion. However, there are many other facets of approximatecomputing that have already been implemented in non-MLworks, which can be reap additional benefits by utilizingML. For example, APPROX-NoC [111] reduces networktraffic using approximated and encoded data. Another workexplored a multi-faceted approximation scheme for a smartcamera system [112] using approximate DRAM (lower re-fresh rate), approximate algorithms (loop skipping) and ap-proximate data (lower sensor resolution). Existing compiler-based work [113] for system-wide approximation enhancesprior capabilities to determine approximable code, but reliesupon heuristic searches with representative inputs. Conse-quently, this method does not provide statistical guarantees,such as those in MITHRA [93]. Future work may exploresearches based on deep reinforcement learning (or perhapshierarchical reinforcement learning) to incorporate existingapproximation techniques into a scalable framework forhigh-dimensional approximation and co-optimization.
Implement System-Wide, Component-Level Optimiza-tion : Recent work has begun to explore broader ML-baseddesign and optimization strategies. MLNoC [64] exploresa wide SoC feature space for NoC design optimization.Core and uncore DVFS are combined in Machine LearnedMachines [86], along with LLC dynamic cache partitioningto explore co-optimization potential at run-time. RelatedDNN accelerator research [114] proposed co-optimization ofhardware-based (e.g., bitwidth) and neural network param-eters (e.g., L2 regularization). These works motivate consid-eration for system-wide, component-level ML application.Existing system-level optimization schemes (e.g., [80],[83], [101]) consider configuration opportunities at just asingle and very high level of abstraction (e.g., task allocationor big.LITTLE core configurations). Although these worksmay include low-level features such as NoC utilization andDRAM bandwidth in their ML models, they do not accountfor the impact of component-level optimization techniquessuch as NoC packet routing, cache prefetching, etc. We in-stead envision an ML-based system-wide, component-levelframework for run-time optimization. In this framework,control decisions would involve a larger hierarchy of bothcomponent-level (or lower) features and control options aswell as higher-level decisions, allowing a more comprehen-sive and precise perspective for run-time optimization.
Advance Automated Design : While fully automateddesign might be the ultimate objective, increasingly auto-mated design is nevertheless an important milestone forfuture work. Specifically, as more tasks are automated,there is greater potential to enable a positive-feedback loopbetween machine learning and architecture, providing im-mense practical benefits for both fields. There are, of course,a number of intervening challenges that must be solved,each of which represents a substantial area for future work.One challenge involves modeling the hierarchical struc-ture of architectural components. This model would likelybenefit from integrating pertinent characteristics across the system stack, from process technology to full-system behav-ior, thus generating a highly accurate representation for real-world systems. Another research direction could exploremethods for machine learning models to identify potentialdesign aspects for improvement. Ideally, this model couldexplore not just reconfiguration of pre-existing options (asin [115]), but also generate novel configuration options. Inte-grating these and potentially other capabilities may providea framework to advance automated design. ONCLUSION
Machine learning has rapidly become a powerful tool inarchitecture, with established applicability to design, opti-mization, simulation, and more. Notably, ML has alreadybeen successfully applied to many components, includingthe core, cache, NoC, and memory, with performance oftensurpassing prior state-of-the-art analytical, heuristic, andhuman-expert strategies. Widespread application is furtherfacilitated by diverse training methods and learning mod-els, allowing effective trade-offs between performance andoverhead based on task requirements. These advancementsare likely just the beginning of a revolutionary shift inarchitecture.Optimization opportunities at the model level involv-ing pruning and quantization offer broad benefits by en-abling more practical implementation. Similarly, opportu-nities abound to extend existing work using ever-more-powerful ML models, enabling finer granularity, system-wide implementation. Finally, ML may be applied to en-tirely new aspects of architecture, learning hierarchical orabstract representations to characterize full system behaviorbased on both high and low level details. These extensiveopportunities, along with yet to be envisioned possibilities,may eventually close the loop on highly (or even fully)automated architectural design. R EFERENCES [1] S. Kotsiantis, “Supervised machine learning: A review of classifi-cation techniques,” in
Proceedings of the 2007 Conference on Emerg-ing Artificial Intelligence Applications in Computer Engineering: RealWorld AI Systems with Applications in eHealth, HCI, InformationRetrieval and Pervasive Technologies , pp. 3–24, 2007.[2] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffman, “A proba-bilistic graphical model-based approach for minimizing energyunder performance constraints,” in
International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS) , Mar. 2015.[3] V. N. Vapnik, “An overview of statistical learning theory,”
IEEETransactions on Neural Networks , vol. 10, Sep. 1999.[4] J. Shlens, “A tutorial on principal component analysis,” 2014.arXiv:1404.1100.[5] M. Alawieh, F. Wang, and X. Li, “Efficient hierarchical perfor-mance modeling for integrated circuits via bayesian co-learning,”in
Design Automation Conference (DAC) , June 2017.[6] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduc-tion . Cambridge, USA: MIT Press, 2nd ed., 1998.[7] I. Guyon and A. Elisseeff, “An introduction to variable andfeature selection,”
The Journal of Machine Learning Research , vol. 3,pp. 1157–1182, Mar. 2003.[8] J. Li, K. Chen, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, andH. Liu, “Feature selection: A data perspective,”
ACM ComputingSurveys , vol. 50, Jan. 2018.[9] E. Ipek, S. A. McKee, B. R. de Supinski, M. Schulz, and R. Caru-ana, “Efficiently exploring architectural design spaces via predic-tive modeling,” in
International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) , Oct.2006.[10] B. Ozisikyilmaz, G. Memik, and A. Choudhary, “Machine learn-ing models to predict performance of computer system designalternatives,” in
International Conference on Parallel Processing(ICPP) , Sept. 2008. [11] S. Eyerman, K. Hoste, and L. Eeckhout, “Mechanistic-empiricalprocessor performance modeling for constructing cpi stacks onreal hardware,” in
International Symposium on Performance Analy-sis of Systems and Software (ISPASS) , Apr. 2011.[12] X. Zheng, P. Ravikumar, L. K. John, and A. Gerstlauer, “Learning-based analytical cross-platform performance prediction,” in
Inter-national Conference on Embedded Computer Systems: Architectures,Modeling, and Simulation (SAMOS) , July 2015.[13] X. Zheng, L. K. John, and A. Gerstlauer, “Accurate phase-levelcross-platform power and performance estimation,” in
DesignAutomation Conference (DAC) , June 2016.[14] N. Agarwal, T. Jain, and M. Zahran, “Performance predictionfor multi-threaded applications,” in
International Workshop on AI-assisted Design for Architecture (AIDArc), held in conjunction withISCA , June 2019.[15] W. Jia, K. A. Shaw, and M. Martonosi, “Stargazer: Automatedregression-based gpu design space exploration,” in
InternationalSymposium on Performance Analysis of Systems and Software (IS-PASS) , Apr. 2012.[16] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, andD. Chiou, “Gpgpu performance and power estimation using ma-chine learning,” in
International Symposium on High-PerformanceComputer Architecture (HPCA) , Feb. 2015.[17] A. Jooya, N. Dimopoulos, and A. Baniasadi, “Multiobjective gpudesign space exploration optimization,” in
International Confer-ence on High Performance Computing & Simulation (HPCS) , July2016.[18] T.-R. Lin, Y. Li, M. Pedram, and L. Chen, “Design space explo-ration of memory controller placement in throughput processorswith deep learning,” in
IEEE Computer Architecture Letters , vol. 18,Mar. 2019.[19] I. Baldini, S. J. Fink, and E. Altman, “Predicting gpu performancefrom cpu runs using machine learning,” in
International Sympo-sium on Computer Architecture and High Performance Computing(SBAC-PAD) , Oct. 2014.[20] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu,“Cross-architecture performance prediction (xapp) using cpucode to predict gpu performance,” in
International Symposium onMicroarchitecture (MICRO) , June 2015.[21] N. Ardalani, U. Thakker, A. Albarghouthi, and K. Sankaralingam,“A static analysis-based cross-architecture performance predic-tion using machine learning,” in
International Workshop on AI-assisted Design for Architecture (AIDArc), held in conjunction withISCA , June 2019.[22] K. O’Neal, P. Brisk, E. Shriver, and M. Kishinevsky, “Hal-wpe: Hardware-assisted light weight performance estimation forgpus,” in
Design Automation Conference (DAC) , June 2017.[23] Y. Li, D. Penney, A. Ramamurthy, and L. Chen, “Characterizingon-chip traffic patterns in general-purpose gpus: A deep learningapproach,” in
International Conference on Computer Design (ICCD) ,Nov. 2019.[24] A. Pattnaik, X. Tang, A. Jog, O. Kayran, A. K. Mishra, M. T.Kandemir, O. Mutlu, and C. R. Das, “Scheduling techniquesfor gpu architectures with processing-in-memory capabilities,”in
International Conference on Parallel Architectures and CompilationTechniques (PACT) , Sept. 2016.[25] L. Peled, S. Mannor, U. Weiser, and Y. Etsion, “Semantic localityand context-based prefetching using reinforcement learning,” in
International Symposium on High-Performance Computer Architecture(HPCA) , June 2015.[26] Y. Zeng and X. Guo, “Long short term memory based hardwareprefetcher,” in
International Symposium on Memory Systems (Mem-Sys) , Oct. 2017.[27] P. Braun and H. Litz, “Understanding memory access patternsfor prefetching,” in
International Workshop on AI-assisted Designfor Architecture (AIDArc), held in conjunction with ISCA , June 2019.[28] E. Bhatia, G. Chacon, S. Pugsley, E. Teran, P. V. Gratz, and D. A.Jim´enez, “Perceptron-based prefetch filtering,” in
InternationalSymposium on Computer Architecture (ISCA) , June 2019.[29] E. Teran, Z. Wang, and D. A. Jim´enez, “Perceptron learning forreuse prediction,” in
International Symposium on Microarchitecture(MICRO) , Oct. 2016.[30] H. Wang, X. Yi, P. Huang, B. Cheng, and K. Zhou, “Efficient ssdcaching by avoiding unnecessary writes using machine learn-ing,” in
International Conference on Parallel Processing (ICPP) , Aug.2018.[31] A. Margaritov, D. Ustiugov, E. Bugnion, and B. Grot, “Virtualaddress translation via learned page tables indexes,” in
Conferenceon Neural Information Processing Systems (NeurIPS) , Dec. 2018.[32] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “Thecase for learned index structures,” in
International Conference onManagement of Data (SIGMOD) , June 2018. [33] E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana, “Self-optimizing memory controllers: A reinforcement learning ap-proach,” in International Symposium on High-Performance ComputerArchitecture (HPCA) , June 2008.[34] J. Mukundan and J. F. Martinez, “Morse: Multi-objective re-configurable self-optimizing memory scheduler,” in
InternationalSymposium on High-Performance Computer Architecture (HPCA) ,Feb. 2012.[35] S. Manoj, H. Yu, H. Huang, and D. Xu, “A q-learning basedself-adaptive i/o communication for 2.5d integrated many-coremicroprocessor and memory,”
IEEE Transactions on Computers ,vol. 65, June 2015.[36] S. Wang and E. Ipek, “Reducing data movement energy viaonline data clustering and encoding,” in
International Symposiumon Microarchitecture (MICRO) , Oct. 2016.[37] W. Kang and S. Yoo, “Dynamic management of key states forreinforcement learning-assisted garbage collection to reduce longtail latency in ssd,” in
Design Automation Conference (DAC) , June2018.[38] Z. Deng, L. Zhang, N. Mishra, H. Hoffman, and F. T. Chong,“Memory cocktail therapy: A general learning-based frameworkto optimize dynamic tradeoffs in nvms,” in
International Sympo-sium on Microarchitecture (MICRO) , Oct. 2017.[39] J. Xiao, Z. Xiong, S. Wu, Y. Yi, H. Jin, and K. Hu, “Disk failureprediction in data centers via online learning,” in
InternationalConference on Parallel Processing (ICPP) , June 2018.[40] D. A. Jim´enez and C. Lin, “Dynamic branch prediction with per-ceptrons,” in
International Symposium on High-Performance Com-puter Architecture (HPCA) , Jan. 2001.[41] R. S. Amant, D. A. Jim´enez, and D. Burger, “Low-power, high-performance analog neural branch prediction,” in
InternationalSymposium on Microarchitecture (MICRO) , Nov. 2008.[42] D. A. Jim´enez, “An optimized scaled neural branch predictor,” in
International Conference on Computer Design (ICCD) , Oct. 2011.[43] E. Garza, S. Mirbagher-Ajorpaz, T. A. Khan, and D. A. Jim´enez,“Bit-level perceptron prediction for indirect branches,” in
Inter-national Symposium on Computer Architecture (ISCA) , June 2019.[44] A. Seznec, “Tage-sc-l branch predictors again,” in , 2016.[45] S. J. Tarsa, C.-K. Lin, G. Keskin, G. Chinya, and H. Wang,“Improving branch prediction by modeling global history withconvolutional neural networks,” in
International Workshop on AI-assisted Design for Architecture (AIDArc), held in conjunction withISCA , June 2019.[46] A. G. Savva, T. Theocharides, and V. Soteriou, “Intelligent on/offdynamic link management for on-chip networks,” in
Journal ofElectrical and Computer Engineering - Special issue on Networks-on-Chip: Architectures, Design Methodologies, and Case Studies , Jan2012.[47] D. DiTomaso, A. Sikder, A. Kodi, and A. Louri, “Machine learn-ing enabled power-aware network-on-chip design,” in
Design,Automation and Test in Europe (DATE) , Mar. 2017.[48] S. V. Winkle, A. Kodi, R. Bunescu, and A. Louri, “Extendingthe power-efficiency and performance of photonic interconnectsfor heterogeneous multicores with machine learning,” in
In-ternational Symposium on High-Performance Computer Architecture(HPCA) , Feb. 2018.[49] M. F. Reza, T. T. Le, B. De, M. Bayoumi, and D. Zhao, “Neuro-noc: Energy optimization in heterogeneous many-core noc usingneural networks in dark silicon era,” in
International Symposiumon Circuits and Systems (ISCAS) , May 2018.[50] M. Clark, A. Kodi, R. Bunescu, and A. Louri, “Lead: Learning-enabled energy-aware dynamic voltage/frequency scaling innocs,” in
Design Automation Conference (DAC) , June 2018.[51] Q. Fettes, M. Clark, R. Bunescu, A. Karanth, and A. Louri,“Dynamic voltage and frequency scaling in nocs with supervisedand reinforcement learning techniques,”
IEEE Transactions onComputers , vol. 68, Mar. 2019.[52] J. A. Boyan and M. L. Littman, “Packet routing in dynami-cally changing networks: a reinforcement learning approach,”
Advances in Neural Information Processing Systems , vol. 6, pp. 671–678, 1994.[53] M. Majer, C. Bobda, A. Ahmadinia, and J. Teich, “Packet routingin dynamically changing networks on chip,” in
InternationalParallel and Distributed Processing Symposium (IPDPS) , Apr. 2005.[54] C. Feng, Z. Lu, A. Jantsch, J. Li, and M. zhang, “A reconfigurablefault-tolerant deflection routing algorithm based on reinforce-ment learning for network-on-chip,” in
International Workshop onNetwork on Chip Architectures (NoCArc), held in conjunction withMICRO , Dec. 2010.[55] M. Ebrahimi, M. Daneshtalab, and F. Farahnakian, “Haraq:Congestion-aware learning model for highly adaptive routing algorithm in on-chip networks,” in
International Symposium onNetworks-on-Chip (NOCS) , June 2012.[56] B. K. Daya, L.-S. Peh, and A. P. Chandrakasan, “Quest for high-performance bufferless nocs with single-cycle express paths andself-learning throttling,” in
Design Automation Conference (DAC) ,June 2016.[57] B. Wang, Z. Lu, and S. Chen, “Ann based admission control foron-chip networks,” in
Design Automation Conference (DAC) , June2019.[58] V. Soteriou, T. Theocharides, and E. Kakoulli, “A holistic ap-proach towards intelligent hotspot prevention in network-on-chip-based multicores,”
IEEE Transactions on Computers , vol. 65,May 2015.[59] J. Yin, Y. Eckert, S. Che, M. Oskin, and G. H. Loh, “Towardmore efficient noc arbitration: A deep reinforcement learningapproach,” in
International Workshop on AI-assisted Design forArchitecture (AIDArc), held in conjunction with ISCA , June 2018.[60] S. Das, J. R. Doppa, D. H. Kim, P. P. Pande, and K. Chakrabarty,“Optimizing 3d noc design for energy efficiency: A machinelearning approach,” in
International Conference on Computer-AidedDesign (ICCAD) , Nov. 2015.[61] S. Das, J. R. Doppa, P. P. Pande, and K. Chakrabarty, “Energy-efficient and reliable 3d network-on-chip (noc): Architectures andoptimization algorithms,” in
International Conference on Computer-Aided Design (ICCAD) , Nov. 2016.[62] B. K. Joardar, R. G. Kim, J. R. Doppa, P. P. Pande, D. Marculescu,and R. Marculescu, “Learning-based application-agnostic 3d nocdesign for heterogeneous manycore systems,”
IEEE Transactionson Computers , vol. 68, June 2019.[63] T.-R. Lin, D. Penney, M. Pedram, and L. Chen, “Optimizingrouterless network-on-chip designs:an innovative learning-basedframework,” May 2019. arXiv:1905.04423.[64] N. Rao, A. Ramachandran, and A. Shah, “Mlnoc: A machinelearning based approach to noc design,” in
International Sym-posium on Computer Architecture and High Performance Computing(SBAC-PAD) , Sept. 2018.[65] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, “A simulatedannealing-based multiobjective optimization algorithm: Amosa,”
IEEE Transactions on Evolutionary Computation , vol. 12, May 2008.[66] Z. Qian, D.-C. Juan, P. Bogdan, C.-Y. Tsui, D. Marculescu,and R. Marculescu, “Svr-noc: A performance analysis tool fornetwork-on-chips using learning-based support vector regressionmodel,” in
Design, Automation and Test in Europe (DATE) , Mar.2013.[67] K. Sangaiah, M. Hempstead, and B. Taskin, “Uncore rpd: Rapiddesign space exploration of the uncore via regression modeling,”in
International Conference on Computer-Aided Design (ICCAD) ,Nov. 2015.[68] D. DiTomaso, T. Boraten, A. Kodi, and A. Louri, “Dynamic errormitigation in nocs using intelligent prediction techniques,” in
International Symposium on Microarchitecture (MICRO) , Oct. 2016.[69] K. Wang, A. Louri, A. Karanth, and R. Bunescu, “High-performance, energy-efficient, fault-tolerant network-on-chip de-sign using reinforcement learning,” in
Design, Automation and Testin Europe (DATE) , Mar. 2019.[70] K. Wang, A. Louri, A. Karanth, and R. Bunescu, “Intellinoc: Aholistic design framework for energy-efficient and reliable on-chip communication for manycores,” in
International Symposiumon Computer Architecture (ISCA) , June 2019.[71] J.-Y. Won, X. Chen, P. Gratz, J. Hu, and V. Soteriou, “Up by theirbootstraps: Online learning in artificial neural networks for cmpuncore power management,” in
International Symposium on High-Performance Computer Architecture (HPCA) , Feb. 2014.[72] G.-Y. Pan, J.-Y. Jou, and B.-C. Lai, “Scalable power managementusing multilevel reinforcement learning for multiprocessors,” in
ACM Transactions on Design Automation of Electronic Systems , Aug.2014.[73] P. E. Bailey, D. K. Lowenthal, V. Ravi, B. Rountree, M. Schulz, andB. R. de Supinski, “Adaptive configuration selection for power-constrained heterogeneous systems,” in
International Conferenceon Parallel Processing (ICPP) , Sept. 2014.[74] D. Lo, T. Song, and G. E. Suh, “Prediction-guided performance-energy trade-off for interactive applications,” in
InternationalSymposium on Microarchitecture (MICRO) , Dec. 2015.[75] N. Mishra, J. D. Lafferty, and H. Hoffman, “Caloree: Learningcontrol for predictable latency and low energy,” in
InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) , Mar. 2018.[76] Y. Bai, V. W. Lee, and E. Ipek, “Voltage regulator efficiency awarepower management,” in
International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASP-LOS) , Apr. 2017. [77] M. Allen and P. Fritzsche, “Reinforcement learning with adaptivekanerva coding for xpilot game ai,” in IEEE Congress of Evolution-ary Computation , June 2011.[78] Z. Chen and D. Marculescu, “Distributed reinforcement learningfor power limited many-core system performance optimization,”in
Design, Automation and Test in Europe (DATE) , Mar. 2015.[79] Z. Chen, D. Stamoulis, and D. Marculescu, “Profit: Priority andpower/performance optimization for many-core systems,”
IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems , vol. 37, pp. 2064–2075, Oct. 2018.[80] C. Imes, S. Hofmeyr, and H. Hoffman, “Energy-efficient applica-tion resource scheduling using machine learning classifiers,” in
International Conference on Parallel Processing (ICPP) , Aug. 2018.[81] S. J. Tarsa, R. B. R. Chowdhury, J. Sebot, G. Chinya, J. Gaur,K. Sankaranarayanan, C.-K. Lin, R. Chappell, R. Singhal, andH. Wang, “Post-silicon cpu adaptation made practical usingmachine learning,” in
International Symposium on Computer Ar-chitecture (ISCA) , June 2019.[82] S. J. Lu, R. Tessier, and W. Burleson, “Reinforcement learning forthermal-aware many-core task allocation,” in
Proceedings of the25th edition on Great Lakes Symposium on VLSI , May 2015.[83] D. Nemirovsky, T. Arkose, N. Markovic, M. Nemirovsky, O. Un-sal, and A. Cristal, “A machine learning approach for perfor-mance prediction and scheduling on heterogeneous cpus,” in
International Symposium on Computer Architecture and High Per-formance Computing (SBAC-PAD) , Oct. 2017.[84] H. Zhang, B. Tang, X. Geng, and H. Ma, “Learning drivenparallelization for large-scale video workload in hybrid cpu-gpucluster,” in
International Conference on Parallel Processing (ICPP) ,Aug. 2018.[85] R. Bitirgen, E. Ipek, and J. F. Martinez, “Coordinated manage-ment of multiple interacting resources in chip multiprocessors:A machine learning approach,” in
International Symposium onMicroarchitecture (MICRO) , Nov. 2008.[86] R. Jain, P. R. Panda, and S. Subramoney, “Machine learnedmachines: Adaptive co-optimization of caches, cores, and on-chipnetwork,” in
Design, Automation and Test in Europe (DATE) , Mar.2016.[87] Y. Ding, N. Mishra, and H. Hoffmann, “Generative and multi-phase learning for computer systems optimization,” in
Interna-tional Symposium on Computer Architecture (ISCA) , June 2019.[88] G. Wu, Y. Xu, D. Wu, M. Ragupathy, Y. yen Mo, and C. Chu,“Flip-flop clustering by weighted k-means algorithm,” in
DesignAutomation Conference (DAC) , June 2016.[89] M. Ozsoy, K. N. Khasawneh, C. Donovick, I. Gorelik, N. Abu-Ghazaleh, and D. Ponomarev, “Hardware-based malware detec-tion using low-level architectural features,”
IEEE Transactions onComputers , vol. 65, Mar. 2016.[90] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neuralacceleration for general-purpose approximate programs,” in
In-ternational Symposium on Microarchitecture (MICRO) , Dec. 2012.[91] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, andH. Esmaeilzadeh, “Neural acceleration for gpu throughput pro-cessors,” in
International Symposium on Microarchitecture (MICRO) ,Dec. 2015.[92] B. Grigorian, N. Farahpour, and G. Reinman, “Brainiac: Bringingreliable accuracy into neurally-implemented approximate com-puting,” in
International Symposium on High-Performance ComputerArchitecture (HPCA) , Feb. 2015.[93] D. Mahajan, A. Yazdanbaksh, J. Park, B. Thwaites, and H. Es-maeilzadeh, “Towards statistical guarantees in controlling qual-ity tradeoffs for approximate acceleration,” in
International Sym-posium on Computer Architecture (ISCA) , June 2016.[94] G. F. Oliveira, L. R. Goncalves, M. Brandalero, A. C. S. Beck,and L. Carro, “Employing classification-based algorithms forgeneral-purpose approximate computing,” in
Design AutomationConference (DAC) , June 2018.[95] F. N. Taher, J. Callenes-Sloan, and B. C. Schafer, “A machinelearning based hard fault recuperation model for approximatehardware accelerators,” in
Design Automation Conference (DAC) ,June 2018.[96] X. Chen, Z. Xu, H. Kim, P. Gratz, J. Hu, M. Kishinevsky, andU. Ogras, “In-network monitoring and control policy for dvfsof cmp networks-on-chip and last level caches,” in
InternationalSymposium on Networks-on-Chip (NOCS) , May 2012.[97] R. Sutton, “Generalization in reinforcement learning: Successfulexamples using sparse coarse coding,” in
Conference on NeuralInformation Processing Systems (NeurIPS) , June 1996.[98] J. A. Boyan and A. W. Moore, “Learning evaluation functionsto improve optimization by local search,”
The Journal of MachineLearning Research , Sep. 2001.[99] P. Bratley and B. L. Fox, “Algorithm 659: Implementing sobol’squasirandom sequence generator,”
ACM Transactions on Mathe-matical Software , vol. 14, Mar. 1988. [100] T. Sherwood, E. Perelman, and B. Calder, “Basic block distribu-tion analysis to find periodic behavior and simulation points inapplications,” in
International Conference on Parallel Architecturesand Compilation Techniques (PACT) , Sept. 2001.[101] W. Wang, J. W. Davidson, and M. L. Soffa, “Predicting the mem-ory bandwidth and optimal core allocations for multi-threadedapplications on large-scale numa machines,” in
International Sym-posium on High-Performance Computer Architecture (HPCA) , Mar.2016.[102] R. M. Kretchmar, “Reinforcement learning algorithms for ho-mogenous multi-agent systems,” in
Workshop on Agent and SwarmProgramming , 2003.[103] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum,“Hierarchical deep reinforcement learning: Integrating temporalabstraction and intrinsic motivation,” in
Conference on NeuralInformation Processing Systems (NeurIPS) , Dec. 2016.[104] H. Mao, Z. Gong, Z. Zhang, Z. Xiao, and Y. Ni, “Learning multi-agent communication under limited-bandwidth restriction forinternet packet routing,” Feb. 2019. arXiv:1903.05561.[105] V. Sze, Y.-H. Chen, T.-J. Yang, and J. Emer, “Efficient processingof deep neural networks: A tutorial and survey,” Aug. 2017.arXiv:1703.09039.[106] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning bothweights and connections for efficient neural networks,” Oct. 2015.arXiv:1506.02626.[107] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu,and A. Liotta, “Scalable training of artificial neural networkswith adaptive sparse connectivity inspired by network science,”
Nature Communications , vol. 9, June 2018.[108] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarized neural networks: Training deep neural networks withweights and activations constrained to +1 or -1,” Mar. 2016.arXiv:1602.02830.[109] M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, “Looknn:Neural network with no multiplication,” in
Design, Automationand Test in Europe (DATE) , Mar. 2017.[110] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine learning in Python,”
Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.[111] R. Boyapati, J. Huang, P. Majumder, K. H. Yum, and E. J. Kim,“Approx-noc: A data approximation framework for network-on-chip architectures,” in
International Symposium on ComputerArchitecture (ISCA) , June 2017.[112] A. Raha and V. Raghunathan, “Towards full-system energy-accuracy tradeoffs: A case study of an approximate smart camerasystem,” in
Design Automation Conference (DAC) , June 2017.[113] A. Sampson, A. Baixo, B. Ransford, T. Moreau, J. Yip, L. Ceze, andM. Oskin, “Accept: A programmer-guided compiler frameworkfor practical approximate computing,”
University of WashingtonTechnical Report , vol. 1, Jan. 2015.[114] B. Reagen, J. M. Hern´andez-Lobato, R. Adolf, M. Gelbart, P. Waht-moug, G.-Y. Wei, and D. Brooks, “A case for efficient acceleratordesign space exploration via bayesian optimization,” in
Interna-tional Symposium on Low Power Electronics and Design (ISLPED) ,July 2017.[115] A. Vallero, A. Savino, G. Politano, S. D. Carlo, A. Chatzidimitriou,S. Tselonis, M. Kaliorakis, D. Gizopoulos, M. Riera, R. Canal,A. Gonzalez, M. Kooli, A. Bosio, and G. D. Natale, “Cross-layersystem reliability assessment framework for hardware faults,” in