[PDF] Voltage Scaling for Partitioned Systolic Array in A Reconfigurable Platform

Abstract

The exponential emergence of Field Programmable Gate Array (FPGA) has accelerated the research of hardware implementation of Deep Neural Network (DNN). Among all DNN processors, domain specific architectures, such as, Google's Tensor Processor Unit (TPU) have outperformed conventional GPUs. However, implementation of TPUs in reconfigurable hardware should emphasize energy savings to serve the green computing requirement. Voltage scaling, a popular approach towards energy savings, can be a bit critical in FPGA as it may cause timing failure if not done in an appropriate way. In this work, we present an ultra low power FPGA implementation of a TPU for edge applications. We divide the systolic-array of a TPU into different FPGA partitions, where each partition uses different near threshold (NTC) biasing voltages to run its FPGA cores. The biasing voltage for each partition is roughly calculated by the proposed offline schemes. However, further calibration of biasing voltage is done by the proposed online scheme. Four clustering algorithms based on the slack value of different design paths study the partitioning of FPGA. To overcome the timing failure caused by NTC, the higher slack paths are placed in lower voltage partitions and lower slack paths are placed in higher voltage partitions. The proposed architecture is simulated in Artix-7 FPGA using the Vivado design suite and Python tool. The simulation results substantiate the implementation of voltage scaled TPU in FPGAs and also justifies its power efficiency.

Full PDF

VVoltage Scaling for Partitioned Systolic Array in AReconﬁgurable Platform

Rourab Paul , Sreetama Sarkar , Suman Sau , Koushik Chakraborty , Sanghamitra Roy and Amlan Chakrabarti Computer Science & Engineering, Siksha ’O’ Anusandhan, Odisha, India Electrical and Computer Engineering Technical University Munich, Germany Computer Science & Information Technology, Siksha ’O’ Anusandhan, Odisha, India Dept. Electrical and Computer Engineering, Utah State University, Logan, USA School of IT, University of Calcutta Kolkata, India [email protected] Abstract —The exponential emergence of Field ProgrammableGate Array (FPGA) has accelerated the research of hardwareimplementation of Deep Neural Network (DNN). Among all DNNprocessors, domain speciﬁc architectures, such as, Google’s Ten-sor Processor Unit (TPU) have outperformed conventional GPUs.However, implementation of TPUs in reconﬁgurable hardwareshould emphasize energy savings to serve the green computingrequirement. Voltage scaling, a popular approach towards energysavings, can be a bit critical in FPGA as it may cause timingfailure if not done in an appropriate way. In this work, wepresent an ultra low power FPGA implementation of a TPUfor edge applications. We divide the systolic-array of a TPU intodifferent FPGA partitions, where each partition uses differentnear threshold (NTC) biasing voltages to run its FPGA cores.The biasing voltage for each partition is roughly calculated bythe proposed ofﬂine schemes. However, further calibration ofbiasing voltage is done by the proposed online scheme. Fourclustering algorithms based on the slack value of different designpaths study the partitioning of FPGA. To overcome the timingfailure caused by NTC, the higher slack paths are placed inlower voltage partitions and lower slack paths are placed inhigher voltage partitions. The proposed architecture is simulatedin

Artix − FPGA using the

V ivado design suite and Python tool.The simulation results substantiate the implementation of voltagescaled TPU in FPGAs and also justiﬁes its power efﬁciency.

Index Terms —FPGA partition, Low Power, TPU, VoltageScaling

I. I

NTRODUCTION

The conﬁgurable logic block (CLB) and switch matrixof FPGAs are power hungry, which makes FPGAs energyinefﬁcient compared to ASICs. Recently many researchers[1], [2] have reported CPU-FPGA based hybrid data centerarchitectures which provides hardware acceleration facility forDNNs. Despite power inefﬁciency, FPGA becomes popular inthe Cloud-Scale acceleration architecture due to its specializedhardware and the economic beneﬁts of homogeneity. There-fore, reducing power in FPGA for DNN applications becomesa very relevant topic of research. Article [3] has studied thetiming failure vs biasing voltage of DNN implementation inFPGA. They have underscaled biasing voltage V ccint of theentire FPGA to increase the power efﬁciency of ConvolutionNeural Network (CNN) accelerator by a factor of 3. A single V ccint for the entire FPGA might not be the most power efﬁcient solution. Partitioning an FPGA according to theslacks and feeding different biasing voltages for differentpartitions can cause further reduction of power for CNNimplementations. In [4], the authors have implemented asystolic array using near threshold (NTC) biasing voltagein ASIC, which can predict the timing failure of multiplier-accumulators (MACs) placed inside the systolic array of TPU.The prediction of timing failure is based on Razor ﬂipﬂop[5]. Higher ﬂuctuation of input bits increases the possibilityof timing failure in NTC condition. In [4], once the timingfailure of a MAC is predicted by its internal

Razor ﬂipﬂop,the biasing voltage of the MAC is boosted up.Targeting FPGA based DNN applications [1], our workinvestigates voltage scaling techniques of TPU in the FPGAplatform. Different V ccint for each of the MACs in a systolicarray will be an absurd implementation for FPGA, thereforethis work partitions FPGA ﬂoor according to the slack valueof the path of MACs. Each partition consists a group ofpaths within the MAC having similar slacks. Each partitionis connected with different V ccint . The proposed methodologyabstracts the synthesis timing report from the V ivado tool. Ina synthesized design, the

V ivado

IDE timing engine estimatesthe net delays of paths based on connectivity and fanout. Theclustering algorithms create some clusters or groups based onpath delays. The clusters with higher delays causing lowerslack are placed in FPGA partitions with higher V ccint and theclusters with lower delays causing higher slack are placed inFPGA partitions with lower V ccint . Here the V ccint providespower to a FPGA core. The tuning of V ccint with slack isdone by unique of f line − online strategy. The circuit levelchallenges on the implementation of voltage scaling in FPGAplatform are beyond the present scope of our article. However,the feasibility of implementing the necessary hardware forvoltage scaling support is evident considering the successfulimplementations in other ASIC technologies. As is unavailablein current FPGAs we have simulated the design for thevalidation of the claim. The contribution of the paper is asfollows: • This paper proposes a new CAD ﬂow to create voltage a r X i v : . [ c s . A R ] F e b caled TPU in FPGA based platforms considering thetrade off of circuit delay against biasing voltage. • The proposed algorithm divides the systolic array of TPUinto different partitions. Each partition will have different V ccint . The V ccint in different partitions is scaled againstthe delay of different slacks. • The calibration of V ccint of different partitions is doneby the proposed online and of f line scheme.The organization of the article is as follows: Sec. II outlinesour proposed EDA tool ﬂow. Sec. III discusses about theclustering algorithms. The methodology of the proposed workis described in Sec. IV. Result, implementation and conclusionare organized in Sec. V and Sec. VI respectively.II. T OOL F LOW

A typical Xilinx FPGA in

V ivado environment has 3conventional steps such as synthesis, implementation and bitﬁle generation whereas the adopted tool ﬂow of the pro-posed partitioned FPGA is divided into two environments: (i)

V ivado

Environment for synthesis, implementation and bit ﬁlegeneration and (ii) Python Environment for clustering similarslacks. The entire tool ﬂow is shown in Fig. 1.

A. Vivado Environment

The

V ivado environment is involved with 3 sub-steps statedbelow:

1) Synthesis:

V ivado synthesis process transforms registertransistor logic (RTL) to gate level representation. The syn-thesis process generates delays of all possible paths of thedesign. The timing report of the synthesis process contains12 information such as name of the path, slack value, level,high fanout, path from, path to, total delay of path, logic delay,net delay, time requirement source clock and destination clockas shown in Table I. It is to be noted that the estimation ofthe slacks of each logic block is at a high level. The actualtiming behavior of the design depends on the net delays afterplacement and routing.

2) Implementation:

The

V ivado

Implementation processis a timing driven ﬂow that transforms a logical netlist andconstraints (Xilinx Design Constraints format) into a placedand routed design to make it ready for the bitstream generationprocess. In our proposed tool ﬂow, the logical netlist isprovided by

V ivado synthesis process but the Xilinx DesignConstraints (XDC) is generated by a python script. Theclustered slack values are considered for placing the logicpaths in a speciﬁc location on the FPGA ﬂoor.

3) Bit File Generation:

Once the placement and routing arecompleted by the implementation process, the ﬂow generatesbitstream of the systolic array. The Xilinx bitstream generationprogram produces a bitstream for the Xilinx device conﬁgura-tion. If there is any requirement of processor, the design mayinclude software application data.

B. Python Environment

The contribution of the paper lies in augmenting the stan-dard FPGA design tool ﬂow by incorporating a python-based

Synthesis Report ClusterAlgorithms ClustersReportGenerateConstraint FileImplementationbit FileGenerate

Vivado Environment Python Enviornment

Fig. 1: Tool FlowFig. 2: TPU Architectureenvironment, which consists of a script to run three subse-quent processes such as choice of

Clustering Algorithms , Cluster Generation and

Constraint Generation .

1) Choice of Clustering Algorithms:

A clustering algorithmsuited to the requirements is chosen at this step. As stated inSec. IV, this paper investigates 4 commonly-used clusteringalgorithms such as Hierarchical, K-means, Mean-shift andDBSCAN.

2) Cluster Generation:

We have assumed that the FPGA isdivided into a few partitions and each partition has a differentbiasing voltage V ccint . The clustering algorithms create fewgroups. The paths having similar slacks form a group andthey are placed in the same FPGA partition.

3) Constraint Generation:

Xilinx uses a constraint ﬁleformat (XDC) to specify the coordinates of different pathsof the proposed systolic array. The XDC ﬁle is generated bythe python script.III. C

LUSTERING A LGORITHMS

We have investigated 4 clustering algorithms to group thepaths having similar slacks. Algorithms can be chosen basedon the design requirements: if we want to set a pre-deﬁnednumber of clusters, or set hyperparameters to automaticallydetermine the number of clusters. Different algorithms workwell for different data distributions. Depending on our designrequirements, we choose among the following four algorithms:

A. Hierarchical

The hierarchical clustering [6] algorithm considers eachdata point as a single cluster and measures distance betweentwo clusters based on a chosen distance measure (in thiscase, Euclidean distance). The two clusters that are closestto each other are merged. The process is continued until allclusters have been merged into a single cluster (root of thedendogram). As shown in ﬁg. 3, the dendogram is a tree-likestructure used for visualizing the hierarchy of clusters. TheABLE I: A Fragment of Timing Report from Synthesis

Name Slack Levels HighFanout From To TotalDelay LogicDelay NetDelay Requirement SourceClock DestinationClockPath 1 5.34 8 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[16]/D 4.37 2.80 1.57 10.00 clk clkPath 2 5.49 8 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[15]/D 4.40 2.83 1.57 10.00 clk clkPath 3 5.52 9 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[14]/D 4.36 2.89 1.47 10.00 clk clkPath 4 5.59 9 8 GEN_REG_I[0].GEN_REG_J[1].uut/prev_activ_reg[1]/C GEN_REG_I[1].GEN_REG_J[1].uut/sig_mac_out_reg[13]/D 4.30 2.83 1.47 10.00 clk clk number of clusters can be decided from the dendogram. Thehierarchical algorithm is computationally expensive for largedatasets, having a time complexity of O ( n ) where n is thenumber of data-points. As is evident from the dendogram, thelength of the branch joining the last two clusters is the highest,indicating they are the most dissimilar, followed by the thirdand fourth clusters. The result of classifying the slack valuesinto 2, 3 and 4 clusters is illustrated in the ﬁg. 4. Differentclusters in Figs. 4, 5, 6 and 7 are indicated using differentcolours. Fig. 3: Dendogram B. K-Means Clustering

K-Means Clustering can cluster data into a predeﬁnednumber of groups ( k ). At the beginning, k cluster centers arerandomly initialized [7]. The algorithm computes the distancebetween each data-point and the cluster-centers and assignsdata-points to the cluster whose center is closest to it. Thecluster centers are then recomputed as the mean of the data-points belonging to that cluster. The process is repeated fora predeﬁned number of steps or until cluster centers do notchange signiﬁcantly. The K-Means Clustering is simple, fast,and its time complexity is O ( n ) . Fig. 5 illustrates the resultsof applying K-Means clustering algorithm on the Slack valuesof a × Systolic Array for 3, 4 and 5 clusters.

C. Mean-Shift Clustering

Mean Shift Clustering [8] is based on the idea of KernelDensity Estimation (KDE). KDE assumes that the data pointsare generated from an underlying distribution and tries toestimate the distribution by assigning a kernel to each datapoint. The most commonly used kernel is the Gaussian or RBFkernel. The mean-shift algorithm is designed in a way that thepoints iteratively climb the KDE surface and are shifted to thenearest KDE peaks. It starts with a randomly selected point asthe center of the RBF kernel. Thereafter, it proceeds by movingthe kernel towards regions of higher density by shifting thecenter of the kernel to the mean of the points within thewindow (hence the algorithm is termed mean-shift). This is continued until shifting the kernel no longer includes morepoints. This algorithm does not need the number of clustersto be speciﬁed beforehand, but it is computationally expensivecompared to K-Means (time complexity of O ( n ∗ log ( n )) inlower dimension for sklearn implementation). The selectionof the window size/radius r can be non-trivial and plays akey-role in the success of the algorithm. Setting the radius as0.4 for the slack values of a × Systolic array, yields 4clusters as observed in the ﬁg. 6.

D. DBSCAN

The DBSCAN algorithm has two important hyperparam-eters, based on which it determines the number of clusters[9], epsilon:

The maximum distance between two samples forone to be considered as in the neighborhood of the other and minpoints:

The number of samples in a neighborhood for apoint to be considered as a core point. At each step, a data-point that has not been visited before is taken. If there are moredata-points than minpoints within its epsilon radius, all thedata-points are marked as belonging to a cluster, otherwisethe ﬁrst point is marked as noise. For all points in the newly-formed cluster, points within their ‘epsilon’ neighborhoodare checked and labeled as either belonging to a cluster ornoise. The process is continued until all data-points have beenlabeled. The greatest advantage of DBSCAN is that it canidentify outliers as noise, unlike other algorithms which throwall points into a cluster even if one data point is signiﬁcantlydifferent from the rest. The time complexity of this algorithmis O ( n ) for reasonable epsilon . This algorithm is not effectivefor clusters with varying density since epsilon and minpoints are different for different clusters.IV. O FFLINE -O NLINE S CHEMES

To mitigate timing failure issue in the critical voltage region,we adopted two sequential schemes such as (i)Ofﬂine schemewhich is involved with FPGA partitioning and rough V ccint i estimation depending on the FPGA technology. (ii) Onlinescheme to calibrate suitable V ccint i for each partition of FPGAusing Razor ﬂipﬂop. Each partition of FPGA consists a groupof MACs. All the groups of MACs form systolic array ofTPU. Apart from systolic array, TPU has memory to storeactive and weight inputs, PCI interface, controlling circuitryetc. The architecture of TPU is shown in Fig. 2.

A. Ofﬂine

The proposed ofﬂine scheme works on

V ivado and

P ython environments. As shown in Fig. 1, synthesis is the ﬁrst stepof the proposed tool ﬂow, which takes a netlist of complexlogic blocks (CLBs) of systolic array generated from

V ivado tool. This netlist from the synthesis report is generated aftertechnology mapping and packing stages which contain timeslacks of all the possible paths of the systolic array. The a) (b) (c) Fig. 4: Hierarchical Cluster of Slack of Systolic Array × : a) (a) (b) (c) Fig. 5: K Means Cluster of Slack of Systolic Array × : a) × Fig. 7: DB Scan Clustering of Slack of Systolic Array × proposed approach considers only nodes along paths because (i) the nodes along the path have data dependencies, whichshould be placed in the same FPGA partition even withoutconsidering the voltage scaling [10]. (ii) The slack values ofthe nodes along paths are usually close to each other. Thesecond step of the proposed methodology is involved withthe choice of the clustering algorithm and cluster generation.As stated in Sec. III, the four clustering algorithms such asHierarchy, K-Mean, Mean-Shift and DBSCAN create multiple clusters with the paths available in the synthesis report. Evenfor the same number of clusters, different algorithms classifythe data-points slightly differently. The primary concern isto identify clusters of nodes along a path, which can sharethe slacks available across that path. Even for the samenumber of clusters, different algorithms classify the data-points slightly differently. Unlike K-means algorithm, the Hi-erarchical, Mean-Shift and DBSCAN do not need the numberof clusters to be speciﬁed beforehand. DBSCAN is foundto perform the best in this case as it groups together data-points close by, has a reasonable time complexity and can alsoidentify outliers. Hence, clustered paths returned by DBSCANare chosen for subsequent simulations.Once the number of clusters is ﬁxed we need to decidethe voltage values of different FPGA partitions. In Fig. 8.,we illustrate 3 voltage regions in an FPGA, which is alsosupported by the research work in [3]. The voltage belowFPGA crashing voltage V crash causes timing failure, whichreduces the DNN accuracy near to zero. The region betweenminimum voltage V min and nominal voltage V nom is calledguardband region where the DNN accuracy will be 100% butpower efﬁciency will be the least. In the critical region, thecloser the voltage is to V crash , higher is the power efﬁciencyand lower the DNN accuracy. Similarly, if V ccint is closerto V min in the critical region, the power efﬁciency decreasesand DNN accuracy increases. In our proposed architecture,we assume the operating voltage range for the systolic arrayis V crash to V min . If we have n clusters computed by thechosen clustering algorithm we need n partitions in FPGA.The primary V ccint estimation for each FPGA partition iscomputed by Algorithm 1. In Xilinx FPGA, the coordinates ofthe nets are speciﬁed by two parameters ( X i , Y j ) . Each FPGApartition has range of these coordinates. In the third step of the riticalRegionCrashRegion GuardbandRegion Vmin VnomVcrash Accuaracy of DesignPower Efficiency

Fig. 8: Voltage behaviour for V ccint proposed methodology, each clustered path computed by theclustering algorithms is placed in a particular FPGA partition,which is restricted by speciﬁc X i , Y j ranges. This restrictionis done in the xdc ﬁle during Generate Constraint F ile process.

Algorithm 1

Voltage Scaling

Require: V ccint , V min , V crash & n V b = V min − V crash n for i=0 to n-1 do V ccint i = V l + V l + V b V l = V l + V b end for B. Online

The V ccint i of the i th FPGA partition calculated by Algo-rithm 1 is calibrated to V ccint i pin of the i th FPGA partition.The calculation of V ccint i by Algorithm 1 is based on thenumber of partitions n and the critical voltage region V min − V crash which solely depends on the type of FPGA technology.However, the appropriate V ccint i of the i th FPGA partitionshould also depend on the slack values of that partition. At theofﬂine strategy we just calculate a rough estimation of V ccint i where as the online strategy calibrates V ccint i according tothe runtime timing failure of the systolic array. In the onlinescheme we used one of the most popular online timing errordetection scheme, Razor , which uses double sampling ﬂipﬂopto detect timing violation of pipeline stages. The

Razor ﬂipﬂop is connected with every MACs of the systolic array toindicate its the timing failure. Each MAC has a timing failureﬂag which is controlled by the

Razor ﬂipﬂop. If any timingfailure ﬂag of any MAC placed in the i th FPGA partition ishigh, the V ccint i of that i th FPGA partition will be increasedby one step. If all the timing failure ﬂags of all MACs placedin the i th FPGA partition is low, the V ccint i of that i th FPGApartition will be decreased by one step. Before starting theactual run of the proposed systolic array, if we have trialrun, all the V ccint i of all partitions will be tuned accuratelyby this online process. The voltage boosting circuit can beimplemented externally following the technique proposed in[11]. In Fig. 9, we have shown that the cluster algorithm par-titions the FPGA into 4 islands. The ofﬂine scheme as stated inSec. IV-A calculates 4 V ccint i such as V ccint , V ccint , V ccint and V ccint for FPGA partition-1, partition-2, partition-3 andpartition-4, respectively. The power distribution unit distributes V ccint i such as V ccint , V ccint , V ccint and V ccint to FPGApartition-1, partition-2, partition-3 and partition-4 respectively. Fig. 9: Example : Partitioned FPGA, n=4Thereafter, the TPU circuit can be on and the online schemebecomes functional. In Fig. 9, 4 FPGA partitions, partition-1, partition-2, partition-3 and partition-4 have 4 ﬂags form Razor ﬂipﬂops, timing _ f ail − part , timing _ f ail − part , timing _ f ail − part and timing _ f ail − part respectivelyto detect the timing failure of the available partition of theFPGA. The width of the timing failure ﬂag from a speciﬁcpartition is the number of MACs available in that partition.If any timing failure ﬂag from any FPGA partition becomeshigh, the V ccint of that partition will boost up by the powerdistribution network.V. I MPLEMENTATION AND R ESULT

As mentioned in Sec. II, the proposed architecture has2 environments. The clustering algorithms are implementedin Python using the Scikit-learn library. The synthesis , implementation and bit f ile generation are done by the V ivado using the board support package of

Artix − FPGA.

A. Implementational Challenges

The proposed design could not be implemented as none ofthe present-day FPGA devices support variable voltage scalingin the different logic partitions. The implementation issuesof power distribution unit with multiple V ccint in differentpartitions are beyond the scope of our paper. However, weconsider, the implementation of voltage scaling technologyin ASIC [4] establishes the feasibility of implementation ofvoltage scaling technology in FPGA. B. Our Validation Strategy

To validate the claim of the proposal, we have partiallyimplemented the proposed scheme. We have designed a × Systolic array, where ×

16 = 256

MACs are placed inthe FPGA. As an example, one of the clustering algorithmmentioned in Sec. III divides × systolic array into 4partitions: partition − , partition − , partition − and partition − . As the current V ivado tool does not allowsimulating the design in critical voltage region, our × systolic array is tested in the guardband region. Due to theunavailability of multiple V ccint support in single a FPGAdice, our design has been implemented in one partition ata time. Therefore, the power measurement of 4 partitions isalso done separately where each partition is considered as anindividual circuit. . Results The guardband region for

Artix − FPGA is 0.95 volt to1.00 volt. For this example n = 4 , V min = V nom = 1 . volt , V crash = V min = 0 . volt , therefor V b = 0 . volt . Algo-rithm 1 calculates the V ccint i of the 4 FPGA partitions of thisdesign which are : V ccint = 0 . for partition − , V ccint =0 . for partition − , V ccint = 0 . for partition − and V ccint = 0 . for partition − . It has been observed thatwhen the partial sums are moved to the bottom rows of systolicarray, the timing error increases signiﬁcantly. Therefore, inthis example the MACs of bottom rows may have less slacks,which should be placed in partition − and partition − where V ccint i is more compared to the existing V ccint i . TheMACs of upper rows should have more slacks which shouldbe placed in partition − and partition − where V ccint i is less compared to the existing V ccint i . As shown in Fig 9,say the clustering algorithm divides the × systolic arrayinto four × systolic array partitions and each partition has × MACs. We assume the top-left partition-1 consistsof a × systolic array which has V ccint = 0 . ≈ . .Similarly, top-right partition-2 has V ccint = 0 . ≈ . ,bottom-left partition-3 has V ccint = 0 . ≈ . and bottomright partition-4 has V ccint = 0 . ≈ . . Table II showsthe dynamic power consumption for V ccint i of 4 partitionswhich together consumes 382 mW, whereas the systolic arraywithout voltage scaling consumes 408 mW power. Therefore,the adoption of voltage scaling scheme on this × systolicarray reduces 6.37% dynamic power consumption for V ccint i .VI. C ONCLUSION

FPGA is becoming popular in DNN based conﬁgurablecloud because of its efﬁciency and manageability but immod-erate power consumption is a growing concern for presentFPGA technology. A lot of effort has been made to reducethe power consumption by using multiple biasing voltagesin FPGA. This paper proposes a TPU architecture where theMACs are placed in different partitions of FPGA based on theslacks of different paths of MACs. Each partition of the FPGAuses different biasing voltage V ccint . The proposed online and of f line schemes can tune appropriate V ccint with thegroup of slacks of MACs which are placed in the partitions.The experimental results show that the voltage scaled systolicarray can reduce power consumption. In future we will addresstwo points such as (i) Improvement of V ccint calibration bygrouping input sequences with similar delay characteristics topredict future timing failures. (ii) Study the tradeoff betweenthe DNN accuracy estimated in terms of timing failures withthe no. of partitions and that between no. of partitions anddynamic power. The same partition based voltage scaling canbe used for other high performance hardware accelerator toreduce the power consumption.R

EFERENCES [1] A. M. Caulﬁeld and et. al. A cloud-scale accelerationarchitecture. In , pages1–13, 2016. TABLE II: Comparison Results

Scheme Dimension ofSystolic Array PartitionNo. V ccint i volt LogicPower (mw) SignalPower (mw) ClockPower (mw) Dynamic Powerfor V ccint (mw)WithoutVoltageScaling × NA 1.00 227 65 115 408 × partition-1 0.96 50 15Voltage × partition-2 0.97 51 16 110 382Scaled × partition-3 0.98 52 16 × partition-4 0.99 53 16 % of Reduction 6.37 [2] A. Putnam and et al. A reconﬁgurable fabric for ac-celerating large-scale datacenter services. IEEE Micro ,35(3):10–22, 2015.[3] B. Salami, E. B. Onural, I. E. Yuksel, F. Koc, O. Ergin,A. Cristal Kestelman, O. Unsal, H. Sarbazi-Azad, andO. Mutlu. An experimental study of reduced-voltageoperation in modern fpgas for neural network acceler-ation. In ,pages 138–149, 2020.[4] P. Pandey, P. Basu, K. Chakraborty, and S. Roy.Greentpu: Improving timing error resilience of a near-threshold tensor processing unit. In , pages1–6, 2019.[5] D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, ToanPham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, andT. Mudge. Razor: a low-power pipeline based on circuit-level timing speculation. In

Proceedings. 36th AnnualIEEE/ACM International Symposium on Microarchitec-ture, 2003. MICRO-36. , pages 7–18, 2003.[6] Stanford University. Hierarchical agglomerative cluster-ing. , 2008.[7] David Arthur and Sergei Vassilvitskii. K-means++: theadvantages of careful seeding. In

In Proceedings ofthe 18th Annual ACM-SIAM Symposium on DiscreteAlgorithms , 2007.[8] D. Comaniciu and P. Meer. Mean shift: a robust approachtoward feature space analysis.

IEEE Transactions onPattern Analysis and Machine Intelligence , 24(5):603–619, 2002.[9] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xi-aowei Xu. A density-based algorithm for discover-ing clusters in large spatial databases with noise. In

Proceedings of the Second International Conference onKnowledge Discovery and Data Mining , KDD’96, page226–231. AAAI Press, 1996.[10] R. Mukherjee and Seda Ogrenci Memik. Realizing lowpower fpgas : A design partitioning algorithm for voltagescaling and a comparative evaluation of voltage scalingtechniques for fpgas. 2005.[11] T. N. Miller, X. Pan, R. Thomas, N. Sedaghati, andR. Teodorescu. Booster: Reactive core acceleration formitigating the effects of process variation and applicationimbalance in low-voltage chips. In