Voltage Scaling for Partitioned Systolic Array in A Reconfigurable Platform
Rourab Paul, Sreetama Sarkar, Suman Sau, Koushik Chakraborty, Sanghamitra Roy, Amlan Chakrabarti
VVoltage Scaling for Partitioned Systolic Array in AReconfigurable Platform
Rourab Paul , Sreetama Sarkar , Suman Sau , Koushik Chakraborty , Sanghamitra Roy and Amlan Chakrabarti Computer Science & Engineering, Siksha ’O’ Anusandhan, Odisha, India Electrical and Computer Engineering Technical University Munich, Germany Computer Science & Information Technology, Siksha ’O’ Anusandhan, Odisha, India Dept. Electrical and Computer Engineering, Utah State University, Logan, USA School of IT, University of Calcutta Kolkata, India [email protected] Abstract —The exponential emergence of Field ProgrammableGate Array (FPGA) has accelerated the research of hardwareimplementation of Deep Neural Network (DNN). Among all DNNprocessors, domain specific architectures, such as, Google’s Ten-sor Processor Unit (TPU) have outperformed conventional GPUs.However, implementation of TPUs in reconfigurable hardwareshould emphasize energy savings to serve the green computingrequirement. Voltage scaling, a popular approach towards energysavings, can be a bit critical in FPGA as it may cause timingfailure if not done in an appropriate way. In this work, wepresent an ultra low power FPGA implementation of a TPUfor edge applications. We divide the systolic-array of a TPU intodifferent FPGA partitions, where each partition uses differentnear threshold (NTC) biasing voltages to run its FPGA cores.The biasing voltage for each partition is roughly calculated bythe proposed offline schemes. However, further calibration ofbiasing voltage is done by the proposed online scheme. Fourclustering algorithms based on the slack value of different designpaths study the partitioning of FPGA. To overcome the timingfailure caused by NTC, the higher slack paths are placed inlower voltage partitions and lower slack paths are placed inhigher voltage partitions. The proposed architecture is simulatedin
Artix − FPGA using the
V ivado design suite and Python tool.The simulation results substantiate the implementation of voltagescaled TPU in FPGAs and also justifies its power efficiency.
Index Terms —FPGA partition, Low Power, TPU, VoltageScaling
I. I
NTRODUCTION
The configurable logic block (CLB) and switch matrixof FPGAs are power hungry, which makes FPGAs energyinefficient compared to ASICs. Recently many researchers[1], [2] have reported CPU-FPGA based hybrid data centerarchitectures which provides hardware acceleration facility forDNNs. Despite power inefficiency, FPGA becomes popular inthe Cloud-Scale acceleration architecture due to its specializedhardware and the economic benefits of homogeneity. There-fore, reducing power in FPGA for DNN applications becomesa very relevant topic of research. Article [3] has studied thetiming failure vs biasing voltage of DNN implementation inFPGA. They have underscaled biasing voltage V ccint of theentire FPGA to increase the power efficiency of ConvolutionNeural Network (CNN) accelerator by a factor of 3. A single V ccint for the entire FPGA might not be the most power efficient solution. Partitioning an FPGA according to theslacks and feeding different biasing voltages for differentpartitions can cause further reduction of power for CNNimplementations. In [4], the authors have implemented asystolic array using near threshold (NTC) biasing voltagein ASIC, which can predict the timing failure of multiplier-accumulators (MACs) placed inside the systolic array of TPU.The prediction of timing failure is based on Razor flipflop[5]. Higher fluctuation of input bits increases the possibilityof timing failure in NTC condition. In [4], once the timingfailure of a MAC is predicted by its internal
Razor flipflop,the biasing voltage of the MAC is boosted up.Targeting FPGA based DNN applications [1], our workinvestigates voltage scaling techniques of TPU in the FPGAplatform. Different V ccint for each of the MACs in a systolicarray will be an absurd implementation for FPGA, thereforethis work partitions FPGA floor according to the slack valueof the path of MACs. Each partition consists a group ofpaths within the MAC having similar slacks. Each partitionis connected with different V ccint . The proposed methodologyabstracts the synthesis timing report from the V ivado tool. Ina synthesized design, the
V ivado
IDE timing engine estimatesthe net delays of paths based on connectivity and fanout. Theclustering algorithms create some clusters or groups based onpath delays. The clusters with higher delays causing lowerslack are placed in FPGA partitions with higher V ccint and theclusters with lower delays causing higher slack are placed inFPGA partitions with lower V ccint . Here the V ccint providespower to a FPGA core. The tuning of V ccint with slack isdone by unique of f line − online strategy. The circuit levelchallenges on the implementation of voltage scaling in FPGAplatform are beyond the present scope of our article. However,the feasibility of implementing the necessary hardware forvoltage scaling support is evident considering the successfulimplementations in other ASIC technologies. As is unavailablein current FPGAs we have simulated the design for thevalidation of the claim. The contribution of the paper is asfollows: • This paper proposes a new CAD flow to create voltage a r X i v : . [ c s . A R ] F e b caled TPU in FPGA based platforms considering thetrade off of circuit delay against biasing voltage. • The proposed algorithm divides the systolic array of TPUinto different partitions. Each partition will have different V ccint . The V ccint in different partitions is scaled againstthe delay of different slacks. • The calibration of V ccint of different partitions is doneby the proposed online and of f line scheme.The organization of the article is as follows: Sec. II outlinesour proposed EDA tool flow. Sec. III discusses about theclustering algorithms. The methodology of the proposed workis described in Sec. IV. Result, implementation and conclusionare organized in Sec. V and Sec. VI respectively.II. T OOL F LOW
A typical Xilinx FPGA in
V ivado environment has 3conventional steps such as synthesis, implementation and bitfile generation whereas the adopted tool flow of the pro-posed partitioned FPGA is divided into two environments: (i)
V ivado
Environment for synthesis, implementation and bit filegeneration and (ii) Python Environment for clustering similarslacks. The entire tool flow is shown in Fig. 1.
A. Vivado Environment
The
V ivado environment is involved with 3 sub-steps statedbelow:
1) Synthesis:
V ivado synthesis process transforms registertransistor logic (RTL) to gate level representation. The syn-thesis process generates delays of all possible paths of thedesign. The timing report of the synthesis process contains12 information such as name of the path, slack value, level,high fanout, path from, path to, total delay of path, logic delay,net delay, time requirement source clock and destination clockas shown in Table I. It is to be noted that the estimation ofthe slacks of each logic block is at a high level. The actualtiming behavior of the design depends on the net delays afterplacement and routing.
2) Implementation:
The
V ivado
Implementation processis a timing driven flow that transforms a logical netlist andconstraints (Xilinx Design Constraints format) into a placedand routed design to make it ready for the bitstream generationprocess. In our proposed tool flow, the logical netlist isprovided by
V ivado synthesis process but the Xilinx DesignConstraints (XDC) is generated by a python script. Theclustered slack values are considered for placing the logicpaths in a specific location on the FPGA floor.
3) Bit File Generation:
Once the placement and routing arecompleted by the implementation process, the flow generatesbitstream of the systolic array. The Xilinx bitstream generationprogram produces a bitstream for the Xilinx device configura-tion. If there is any requirement of processor, the design mayinclude software application data.
B. Python Environment
The contribution of the paper lies in augmenting the stan-dard FPGA design tool flow by incorporating a python-based
Synthesis Report ClusterAlgorithms ClustersReportGenerateConstraint FileImplementationbit FileGenerate
Vivado Environment Python Enviornment
Fig. 1: Tool FlowFig. 2: TPU Architectureenvironment, which consists of a script to run three subse-quent processes such as choice of
Clustering Algorithms , Cluster Generation and
Constraint Generation .
1) Choice of Clustering Algorithms:
A clustering algorithmsuited to the requirements is chosen at this step. As stated inSec. IV, this paper investigates 4 commonly-used clusteringalgorithms such as Hierarchical, K-means, Mean-shift andDBSCAN.
2) Cluster Generation:
We have assumed that the FPGA isdivided into a few partitions and each partition has a differentbiasing voltage V ccint . The clustering algorithms create fewgroups. The paths having similar slacks form a group andthey are placed in the same FPGA partition.
3) Constraint Generation:
Xilinx uses a constraint fileformat (XDC) to specify the coordinates of different pathsof the proposed systolic array. The XDC file is generated bythe python script.III. C
LUSTERING A LGORITHMS
We have investigated 4 clustering algorithms to group thepaths having similar slacks. Algorithms can be chosen basedon the design requirements: if we want to set a pre-definednumber of clusters, or set hyperparameters to automaticallydetermine the number of clusters. Different algorithms workwell for different data distributions. Depending on our designrequirements, we choose among the following four algorithms:
A. Hierarchical
The hierarchical clustering [6] algorithm considers eachdata point as a single cluster and measures distance betweentwo clusters based on a chosen distance measure (in thiscase, Euclidean distance). The two clusters that are closestto each other are merged. The process is continued until allclusters have been merged into a single cluster (root of thedendogram). As shown in fig. 3, the dendogram is a tree-likestructure used for visualizing the hierarchy of clusters. TheABLE I: A Fragment of Timing Report from Synthesis
Name Slack Levels HighFanout From To TotalDelay LogicDelay NetDelay Requirement SourceClock DestinationClockPath 1 5.34 8 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[16]/D 4.37 2.80 1.57 10.00 clk clkPath 2 5.49 8 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[15]/D 4.40 2.83 1.57 10.00 clk clkPath 3 5.52 9 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[14]/D 4.36 2.89 1.47 10.00 clk clkPath 4 5.59 9 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[13]/D 4.30 2.83 1.47 10.00 clk clk number of clusters can be decided from the dendogram. Thehierarchical algorithm is computationally expensive for largedatasets, having a time complexity of O ( n ) where n is thenumber of data-points. As is evident from the dendogram, thelength of the branch joining the last two clusters is the highest,indicating they are the most dissimilar, followed by the thirdand fourth clusters. The result of classifying the slack valuesinto 2, 3 and 4 clusters is illustrated in the fig. 4. Differentclusters in Figs. 4, 5, 6 and 7 are indicated using differentcolours. Fig. 3: Dendogram B. K-Means Clustering
K-Means Clustering can cluster data into a predefinednumber of groups ( k ). At the beginning, k cluster centers arerandomly initialized [7]. The algorithm computes the distancebetween each data-point and the cluster-centers and assignsdata-points to the cluster whose center is closest to it. Thecluster centers are then recomputed as the mean of the data-points belonging to that cluster. The process is repeated fora predefined number of steps or until cluster centers do notchange significantly. The K-Means Clustering is simple, fast,and its time complexity is O ( n ) . Fig. 5 illustrates the resultsof applying K-Means clustering algorithm on the Slack valuesof a × Systolic Array for 3, 4 and 5 clusters.
C. Mean-Shift Clustering
Mean Shift Clustering [8] is based on the idea of KernelDensity Estimation (KDE). KDE assumes that the data pointsare generated from an underlying distribution and tries toestimate the distribution by assigning a kernel to each datapoint. The most commonly used kernel is the Gaussian or RBFkernel. The mean-shift algorithm is designed in a way that thepoints iteratively climb the KDE surface and are shifted to thenearest KDE peaks. It starts with a randomly selected point asthe center of the RBF kernel. Thereafter, it proceeds by movingthe kernel towards regions of higher density by shifting thecenter of the kernel to the mean of the points within thewindow (hence the algorithm is termed mean-shift). This is continued until shifting the kernel no longer includes morepoints. This algorithm does not need the number of clustersto be specified beforehand, but it is computationally expensivecompared to K-Means (time complexity of O ( n ∗ log ( n )) inlower dimension for sklearn implementation). The selectionof the window size/radius r can be non-trivial and plays akey-role in the success of the algorithm. Setting the radius as0.4 for the slack values of a × Systolic array, yields 4clusters as observed in the fig. 6.
D. DBSCAN
The DBSCAN algorithm has two important hyperparam-eters, based on which it determines the number of clusters[9], epsilon:
The maximum distance between two samples forone to be considered as in the neighborhood of the other and minpoints:
The number of samples in a neighborhood for apoint to be considered as a core point. At each step, a data-point that has not been visited before is taken. If there are moredata-points than minpoints within its epsilon radius, all thedata-points are marked as belonging to a cluster, otherwisethe first point is marked as noise. For all points in the newly-formed cluster, points within their ‘epsilon’ neighborhoodare checked and labeled as either belonging to a cluster ornoise. The process is continued until all data-points have beenlabeled. The greatest advantage of DBSCAN is that it canidentify outliers as noise, unlike other algorithms which throwall points into a cluster even if one data point is significantlydifferent from the rest. The time complexity of this algorithmis O ( n ) for reasonable epsilon . This algorithm is not effectivefor clusters with varying density since epsilon and minpoints are different for different clusters.IV. O FFLINE -O NLINE S CHEMES
To mitigate timing failure issue in the critical voltage region,we adopted two sequential schemes such as (i)Offline schemewhich is involved with FPGA partitioning and rough V ccint i estimation depending on the FPGA technology. (ii) Onlinescheme to calibrate suitable V ccint i for each partition of FPGAusing Razor flipflop. Each partition of FPGA consists a groupof MACs. All the groups of MACs form systolic array ofTPU. Apart from systolic array, TPU has memory to storeactive and weight inputs, PCI interface, controlling circuitryetc. The architecture of TPU is shown in Fig. 2.
A. Offline
The proposed offline scheme works on
V ivado and
P ython environments. As shown in Fig. 1, synthesis is the first stepof the proposed tool flow, which takes a netlist of complexlogic blocks (CLBs) of systolic array generated from
V ivado tool. This netlist from the synthesis report is generated aftertechnology mapping and packing stages which contain timeslacks of all the possible paths of the systolic array. The a) (b) (c) Fig. 4: Hierarchical Cluster of Slack of Systolic Array × : a) (a) (b) (c) Fig. 5: K Means Cluster of Slack of Systolic Array × : a) × Fig. 7: DB Scan Clustering of Slack of Systolic Array × proposed approach considers only nodes along paths because (i) the nodes along the path have data dependencies, whichshould be placed in the same FPGA partition even withoutconsidering the voltage scaling [10]. (ii) The slack values ofthe nodes along paths are usually close to each other. Thesecond step of the proposed methodology is involved withthe choice of the clustering algorithm and cluster generation.As stated in Sec. III, the four clustering algorithms such asHierarchy, K-Mean, Mean-Shift and DBSCAN create multiple clusters with the paths available in the synthesis report. Evenfor the same number of clusters, different algorithms classifythe data-points slightly differently. The primary concern isto identify clusters of nodes along a path, which can sharethe slacks available across that path. Even for the samenumber of clusters, different algorithms classify the data-points slightly differently. Unlike K-means algorithm, the Hi-erarchical, Mean-Shift and DBSCAN do not need the numberof clusters to be specified beforehand. DBSCAN is foundto perform the best in this case as it groups together data-points close by, has a reasonable time complexity and can alsoidentify outliers. Hence, clustered paths returned by DBSCANare chosen for subsequent simulations.Once the number of clusters is fixed we need to decidethe voltage values of different FPGA partitions. In Fig. 8.,we illustrate 3 voltage regions in an FPGA, which is alsosupported by the research work in [3]. The voltage belowFPGA crashing voltage V crash causes timing failure, whichreduces the DNN accuracy near to zero. The region betweenminimum voltage V min and nominal voltage V nom is calledguardband region where the DNN accuracy will be 100% butpower efficiency will be the least. In the critical region, thecloser the voltage is to V crash , higher is the power efficiencyand lower the DNN accuracy. Similarly, if V ccint is closerto V min in the critical region, the power efficiency decreasesand DNN accuracy increases. In our proposed architecture,we assume the operating voltage range for the systolic arrayis V crash to V min . If we have n clusters computed by thechosen clustering algorithm we need n partitions in FPGA.The primary V ccint estimation for each FPGA partition iscomputed by Algorithm 1. In Xilinx FPGA, the coordinates ofthe nets are specified by two parameters ( X i , Y j ) . Each FPGApartition has range of these coordinates. In the third step of the riticalRegionCrashRegion GuardbandRegion Vmin VnomVcrash Accuaracy of DesignPower Efficiency
Fig. 8: Voltage behaviour for V ccint proposed methodology, each clustered path computed by theclustering algorithms is placed in a particular FPGA partition,which is restricted by specific X i , Y j ranges. This restrictionis done in the xdc file during Generate Constraint F ile process.
Algorithm 1
Voltage Scaling
Require: V ccint , V min , V crash & n V b = V min − V crash n for i=0 to n-1 do V ccint i = V l + V l + V b V l = V l + V b end for B. Online
The V ccint i of the i th FPGA partition calculated by Algo-rithm 1 is calibrated to V ccint i pin of the i th FPGA partition.The calculation of V ccint i by Algorithm 1 is based on thenumber of partitions n and the critical voltage region V min − V crash which solely depends on the type of FPGA technology.However, the appropriate V ccint i of the i th FPGA partitionshould also depend on the slack values of that partition. At theoffline strategy we just calculate a rough estimation of V ccint i where as the online strategy calibrates V ccint i according tothe runtime timing failure of the systolic array. In the onlinescheme we used one of the most popular online timing errordetection scheme, Razor , which uses double sampling flipflopto detect timing violation of pipeline stages. The
Razor flipflop is connected with every MACs of the systolic array toindicate its the timing failure. Each MAC has a timing failureflag which is controlled by the
Razor flipflop. If any timingfailure flag of any MAC placed in the i th FPGA partition ishigh, the V ccint i of that i th FPGA partition will be increasedby one step. If all the timing failure flags of all MACs placedin the i th FPGA partition is low, the V ccint i of that i th FPGApartition will be decreased by one step. Before starting theactual run of the proposed systolic array, if we have trialrun, all the V ccint i of all partitions will be tuned accuratelyby this online process. The voltage boosting circuit can beimplemented externally following the technique proposed in[11]. In Fig. 9, we have shown that the cluster algorithm par-titions the FPGA into 4 islands. The offline scheme as stated inSec. IV-A calculates 4 V ccint i such as V ccint , V ccint , V ccint and V ccint for FPGA partition-1, partition-2, partition-3 andpartition-4, respectively. The power distribution unit distributes V ccint i such as V ccint , V ccint , V ccint and V ccint to FPGApartition-1, partition-2, partition-3 and partition-4 respectively. Fig. 9: Example : Partitioned FPGA, n=4Thereafter, the TPU circuit can be on and the online schemebecomes functional. In Fig. 9, 4 FPGA partitions, partition-1, partition-2, partition-3 and partition-4 have 4 flags form Razor flipflops, timing _ f ail − part , timing _ f ail − part , timing _ f ail − part and timing _ f ail − part respectivelyto detect the timing failure of the available partition of theFPGA. The width of the timing failure flag from a specificpartition is the number of MACs available in that partition.If any timing failure flag from any FPGA partition becomeshigh, the V ccint of that partition will boost up by the powerdistribution network.V. I MPLEMENTATION AND R ESULT
As mentioned in Sec. II, the proposed architecture has2 environments. The clustering algorithms are implementedin Python using the Scikit-learn library. The synthesis , implementation and bit f ile generation are done by the V ivado using the board support package of
Artix − FPGA.
A. Implementational Challenges
The proposed design could not be implemented as none ofthe present-day FPGA devices support variable voltage scalingin the different logic partitions. The implementation issuesof power distribution unit with multiple V ccint in differentpartitions are beyond the scope of our paper. However, weconsider, the implementation of voltage scaling technologyin ASIC [4] establishes the feasibility of implementation ofvoltage scaling technology in FPGA. B. Our Validation Strategy
To validate the claim of the proposal, we have partiallyimplemented the proposed scheme. We have designed a × Systolic array, where ×
16 = 256
MACs are placed inthe FPGA. As an example, one of the clustering algorithmmentioned in Sec. III divides × systolic array into 4partitions: partition − , partition − , partition − and partition − . As the current V ivado tool does not allowsimulating the design in critical voltage region, our × systolic array is tested in the guardband region. Due to theunavailability of multiple V ccint support in single a FPGAdice, our design has been implemented in one partition ata time. Therefore, the power measurement of 4 partitions isalso done separately where each partition is considered as anindividual circuit. . Results The guardband region for
Artix − FPGA is 0.95 volt to1.00 volt. For this example n = 4 , V min = V nom = 1 . volt , V crash = V min = 0 . volt , therefor V b = 0 . volt . Algo-rithm 1 calculates the V ccint i of the 4 FPGA partitions of thisdesign which are : V ccint = 0 . for partition − , V ccint =0 . for partition − , V ccint = 0 . for partition − and V ccint = 0 . for partition − . It has been observed thatwhen the partial sums are moved to the bottom rows of systolicarray, the timing error increases significantly. Therefore, inthis example the MACs of bottom rows may have less slacks,which should be placed in partition − and partition − where V ccint i is more compared to the existing V ccint i . TheMACs of upper rows should have more slacks which shouldbe placed in partition − and partition − where V ccint i is less compared to the existing V ccint i . As shown in Fig 9,say the clustering algorithm divides the × systolic arrayinto four × systolic array partitions and each partition has × MACs. We assume the top-left partition-1 consistsof a × systolic array which has V ccint = 0 . ≈ . .Similarly, top-right partition-2 has V ccint = 0 . ≈ . ,bottom-left partition-3 has V ccint = 0 . ≈ . and bottomright partition-4 has V ccint = 0 . ≈ . . Table II showsthe dynamic power consumption for V ccint i of 4 partitionswhich together consumes 382 mW, whereas the systolic arraywithout voltage scaling consumes 408 mW power. Therefore,the adoption of voltage scaling scheme on this × systolicarray reduces 6.37% dynamic power consumption for V ccint i .VI. C ONCLUSION
FPGA is becoming popular in DNN based configurablecloud because of its efficiency and manageability but immod-erate power consumption is a growing concern for presentFPGA technology. A lot of effort has been made to reducethe power consumption by using multiple biasing voltagesin FPGA. This paper proposes a TPU architecture where theMACs are placed in different partitions of FPGA based on theslacks of different paths of MACs. Each partition of the FPGAuses different biasing voltage V ccint . The proposed online and of f line schemes can tune appropriate V ccint with thegroup of slacks of MACs which are placed in the partitions.The experimental results show that the voltage scaled systolicarray can reduce power consumption. In future we will addresstwo points such as (i) Improvement of V ccint calibration bygrouping input sequences with similar delay characteristics topredict future timing failures. (ii) Study the tradeoff betweenthe DNN accuracy estimated in terms of timing failures withthe no. of partitions and that between no. of partitions anddynamic power. The same partition based voltage scaling canbe used for other high performance hardware accelerator toreduce the power consumption.R
EFERENCES [1] A. M. Caulfield and et. al. A cloud-scale accelerationarchitecture. In , pages1–13, 2016. TABLE II: Comparison Results
Scheme Dimension ofSystolic Array PartitionNo. V ccint i volt LogicPower (mw) SignalPower (mw) ClockPower (mw) Dynamic Powerfor V ccint (mw)WithoutVoltageScaling × NA 1.00 227 65 115 408 × partition-1 0.96 50 15Voltage × partition-2 0.97 51 16 110 382Scaled × partition-3 0.98 52 16 × partition-4 0.99 53 16 % of Reduction 6.37 [2] A. Putnam and et al. A reconfigurable fabric for ac-celerating large-scale datacenter services. IEEE Micro ,35(3):10–22, 2015.[3] B. Salami, E. B. Onural, I. E. Yuksel, F. Koc, O. Ergin,A. Cristal Kestelman, O. Unsal, H. Sarbazi-Azad, andO. Mutlu. An experimental study of reduced-voltageoperation in modern fpgas for neural network acceler-ation. In ,pages 138–149, 2020.[4] P. Pandey, P. Basu, K. Chakraborty, and S. Roy.Greentpu: Improving timing error resilience of a near-threshold tensor processing unit. In , pages1–6, 2019.[5] D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, ToanPham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, andT. Mudge. Razor: a low-power pipeline based on circuit-level timing speculation. In
Proceedings. 36th AnnualIEEE/ACM International Symposium on Microarchitec-ture, 2003. MICRO-36. , pages 7–18, 2003.[6] Stanford University. Hierarchical agglomerative cluster-ing. , 2008.[7] David Arthur and Sergei Vassilvitskii. K-means++: theadvantages of careful seeding. In
In Proceedings ofthe 18th Annual ACM-SIAM Symposium on DiscreteAlgorithms , 2007.[8] D. Comaniciu and P. Meer. Mean shift: a robust approachtoward feature space analysis.
IEEE Transactions onPattern Analysis and Machine Intelligence , 24(5):603–619, 2002.[9] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xi-aowei Xu. A density-based algorithm for discover-ing clusters in large spatial databases with noise. In
Proceedings of the Second International Conference onKnowledge Discovery and Data Mining , KDD’96, page226–231. AAAI Press, 1996.[10] R. Mukherjee and Seda Ogrenci Memik. Realizing lowpower fpgas : A design partitioning algorithm for voltagescaling and a comparative evaluation of voltage scalingtechniques for fpgas. 2005.[11] T. N. Miller, X. Pan, R. Thomas, N. Sedaghati, andR. Teodorescu. Booster: Reactive core acceleration formitigating the effects of process variation and applicationimbalance in low-voltage chips. In