AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU Scaling
Xianwei Cheng, Hui Zhao, Mahmut Kandemir, Beilei Jiang, Gayatri Mehta
AAMOEBA: A Coarse Grained Reconfigurable Architecturefor Dynamic GPU Scaling
Xianwei Cheng
Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Hui Zhao
Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Kandemir
Computer Science and Engineering DepartmentPennsylvania State University [email protected] Beilei Jiang
Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Mehta
Electrical Engineering DepartmentUniversity of North Texas [email protected]
ABSTRACT
Different GPU applications exhibit varying scalabilitypatterns with network-on-chip (NoC), coalescing, mem-ory and control divergence, and L1 cache behavior. AGPU consists of several Streaming Multi-processors (SMs)that collectively determine how shared resources arepartitioned and accessed. Recent years have seen diver-gent paths in SM scaling towards scale-up (fewer, largerSMs) vs. scale-out (more, smaller SMs). However, nei-ther scaling up nor scaling out can meet the scalabilityrequirement of all applications running on a given GPUsystem, which inevitably results in performance degra-dation and resource under-utilization for some applica-tions. In this work, we investigate major design pa-rameters that influence GPU scaling. We then proposeAMOEBA, a solution to GPU scaling through reconfig-urable SM cores. AMOEBA monitors and predicts ap-plication scalability at run-time and adjusts the SM con-figuration to meet program requirements. AMOEBAalso enables dynamic creation of heterogeneous SMsthrough independent fusing or splitting. AMOEBA isa microarchitecture-based solution and requires no ad-ditional programming effort or custom compiler sup-port. Our experimental evaluations with applicationprograms from various benchmark suites indicate thatAMOEBA is able to achieve a maximum performancegain of 4.3x, and generates an average performance im-provement of 47% when considering all benchmarks tested.
GPUs have emerged as performance accelerators forgeneral purpose applications and take advantage of thesingle instruction multiple threads (SIMT) program-ming model to improve the performance of data-parallelcomputations. Supercomputers [1, 2], cloud servers [3],desktops [4] and even mobile devices [5] already benefitsignificantly from GPUs to achieve better performanceand higher power efficiency. A GPU typically consists ofmany compute units (CUs), also called streaming multi-processors (SMs), and each SM contains a large numberof simple compute cores [6]. GPUs leverage the massive number of computing cores in SMs to exploit thread-level parallelism (TLP) in an attempt to hide memoryaccess latency [7, 8].The multiprocessor industry has been fueled by Moore’sLaw for many years and processor performance is im-proved through increasing transistor counts. However,Moore’s Law is slowing down because we are reachingthe technological limits of how small transistors can bemade. Increasing the chip size can allow us to add moretransistors to a chip. However, this is not a sustainablesolution due to several reasons: (1) there is not enoughpower budget to allow all transistors to be poweredon simultaneously. This is because transistor thresh-old voltage does not scale with technology nodes andper-transistor switching energy almost keeps constant[9], (2) cost becomes prohibitive in manufacturing chipswith ultra-high transistor counts [10], and (3) data com-munication becomes a bottleneck as chip size increases[11].TLP has been considered as a promising solution totackle the slowdown of Moore’s law, and GPU archi-tecture is based on the idea of exploiting TLP. Thecomputing power of GPU arises from its SIMT archi-tecture: many threads are executed concurrently in anSIMD fashion. However, an average programmer maynot be aware of the details of the underlying hardwareto write high-quality code to fully utilize available GPUresources. As a result, many general purpose applica-tions are not fully optimized for running on a specificGPU architecture, and this causes under-utilization ofhardware resources [12, 13, 14, 15, 16, 17, 18]. Forexample, it has been observed that cores are idle for52% to 98% of the execution time for some GPU bench-marks [12]. Therefore, instead of adding more resourcesto GPUs, exploration of optimized resource utilizationtechniques can be a more viable option to enhance theGPU performance and power efficiency.There have been earlier efforts targeting to maximizeGPU resource utilization. For example, several priorworks have proposed to share a GPU among multi-ple applications and used software-level techniques tomanage resource sharing [19, 20, 21]. On the hardware a r X i v : . [ c s . A R ] N ov ide, spatial multitasking has been proposed as a tech-nique that partitions a GPU among multiple kernels atan SM granularity [22]. Several techniques have beenalso proposed to share resources among kernels insideeach SM, such as simultaneous multikernel (SMK) [12],warp-slicer [23], and GPU Maestro [16]. However, thesetechniques need to run multi-programmed workloads tofully utilize GPU resources (i.e., finding more tasks toavoid GPU inactive cycles). Also, they do not consideran application’s scalability and do not configure hard-ware to meet the software’s resource demands.An alternative approach is to dynamically reconfigure hardware to avoid resource under-utilization. Reconfig-urable cores have been proposed in the past for CPUsto facilitate parallel execution [24, 9, 25, 26, 27]. How-ever, the overhead of reconfiguring CPU cores is highdue to the complexities associated with CPU architec-tures. There have been very few reconfigurable GPUarchitectures proposed. For example, R-GPU is pro-posed to interconnect compute cores inside an SM toreduce data movement and remove decoding overheadby assigning each core a fixed operation [14]. In compar-ison, SGMF is proposed as a dataflow architecture usingcoarse-grain reconfigurable fabric [13]. However, SGMFneeds help from compiler to convert the kernels intodataflow graphs. Neither work considers the scalabilityof applications regarding system bottlenecks, such asinterconnection network, memory access patterns, andcontrol divergence. In addition, these prior works onlyexplore intra-SM resource utilization and assume thatthe number and size of SMs are fixed. However, sharingresources among SMs is also important because appli-cations have varying scalability patterns depending onSM settings, but exploration of this design space haslargely been ignored in prior work.In this work, we present AMOEBA, a reconfigurablearchitecture to improve GPU resource utilization, per-formance, and energy efficiency. AMOEBA takes intoaccount several important application resources require-ments such as interconnect throughput, memory accesspatterns, and control divergence, before selecting anoptimal GPU scaling option. AMOEBA is a coarse-grained reconfigurable architecture to enable flexible SMscaling at a low cost. It also explores heterogeneity ofSMs through dynamic fusing and splitting in order toaccommodate program divergence.We make the following contributions in this paper: • We investigate the GPU scaling problem under re-source bound. We identify the important factors de-termining whether an SM should be designed in a scale-up or scale-out fashion. Building upon the re-sults from this investigation, we propose a coarse-grained reconfigurable architecture that fuses the base-line scale-out SMs into larger scale-up SMs. This de-sign enables optimized resource utilization across SMboundaries. • We design an online controller that takes into ac-count an application’s dynamic behavior and makesreconfiguration decisions accordingly. The controlleremploys a binary logistic regression model to predict
Figure 1: GPU architecture overview. application scalability with low cost. • We provide design details of the proposed reconfig-urable architecture. Our proposed architecture en-ables coarse-grained SM fusion, and it can providesupport for both scale-up and scale-out GPU config-urations. • We propose a scheme to split individual scale-up SMcores dynamically when program divergence causespipeline stalls.To the best of our knowledge, this is the first paperto propose a reconfigurable GPU architecture that candynamically toggle between scale-up and scale-out op-tions, with the goal of maximizing resource utilization.
GPU execution model divides the total work space intoa grid and assigns a work item, also called a thread, towork on each portion of data. Each thread executes thesame set of instructions, and this enables parallel multi-threaded execution in an SIMT fashion. Each segmentof code loaded to a GPU is called a kernel. A groupof threads that execute a kernel concurrently is referredto as a workgroup, also called a thread-block or a co-operative thread-array (CTA). The total work space isdivided into blocks or CTAs, and threads within a givenCTA can communicate with each other.A high-level view of a GPU architecture is shown inFigure 1. A GPU consists of multiple compute units(CUs) or streaming multiprocessors (SMs), which areanalogous to CPU cores. Each SM contains fetch, de-code, execution and memory access logic, and theseunits collectively form a pipeline. There are severalcompute cores residing within each SM, and each com-pute core is a large, heavily-pipelined execution unitcapable of executing both integer and floating point op-erations. When a kernel is launched, each CTA is dis-patched to an SM and executes there until its comple-tion. A CTA is further divided into units called warps,also known as wavefronts. Typically, there are a largenumber of warps in flight inside an SM so that memoryaccess latency can be masked by concurrent execution.There is a unified L2 cache that is coupled with memorycontrollers, while the global memory is off-chip. On-chip data communication is implemented through anon-chip network. igure 2: SM scaling trends (number of SMs vsnumbers of cores per SM) for NVIDIA’s GTXGPU family[28]. . The primary execution unit in a GPU is a compute core,and they are grouped into SMs that share resources suchas register file, local memory, and warp scheduler. Dueto the resource limitation, once the chip size is decided,the total number of compute cores is fixed. Then, therearises an important scaling problem: how should com-pute cores and other resources be partitioned into SMs?
That is, should we opt for scale-up
SMs (by includingmore compute cores and resources into a smaller num-ber of SMs) or scale-out
SMs (by having more SMs withfewer cores and less resource inside)? The scaling ofconfiguration of SMs is critical since it directly deter-mines the maximum parallelism among GPU threadsand impacts resource sharing and utilization.Figure 2 shows the scaling trends of NVIDIA’s GTXGPU family during the last 11 years. The number ofcomputing cores per SM can be used to represent thescaling in SM size. We observe that both size and num-ber of SMs are increased from 2008 to 2011. However,after 2011, the trends of SM size and SM count start topart their ways in opposite directions. This is becausewe are reaching the limit in terms of the total numberof computing cores that can be integrated into a chipdue to power and area constraints. Therefore, it is notpossible to scale up both the size and number of SMs;so, we either scale out or scale up, but not both, asshown in the figure. And, the most recent trend hasbeen scaling out since 2017. However, the question iswhether this trend of scaling out is sustainable for thefuture. And, if not, what is the optimal configurationfor the best performance and resource utilization?
As discussed above, warps execute in SMs and all threadsin one SM share GPU resources such as shared mem-ory, L1 cache, register file, warp scheduler, and inter-connect interface. Scaling of SM greatly influences theresource utilization and power-performance efficiency.Due to their different characteristics and resource re-quirements, different applications exhibit varying
SMscaling patterns. We start by investigating the scalingof multiple benchmarks, and the results are plotted in
Figure 3: Performance with SM scaling (a) witha mesh NoC (b) with a perfect NoC. (x-axis isthe number of SMs and y-axis is IPC normalizedto 16 SMs.)Figure 4: Memory access coalescing results withdifferent GPU scaling options. Actual memoryaccess rate represents the memory accesses after coalescing. Here, we experiment with differentSM scaling options with 16, 25, 36, and 64 SMs.
Figure 3(a). In this experiment, we fit the total amountof chip resources but vary the size and the number ofSMs. As can be observed, some applications benefitfrom scaling out with smaller SMs (
CP, SC ), whileother applications benefit from scale-up SMs (
MUM,RAY ). This result indicates that there is not a scalingsetting that benefits all applications. Motivated by thisobservation, we next examine in detail the major fac-tors that determine an application’s performance withSM scaling. (1) NoC Effect on SM Scaling.
GPU SMs areconnected to L2 cache and memory controllers througha network-on-chip (NoC). It has been shown that NoCis a bottleneck in GPU performance as the chip sizegrows [11]. This is due to the particular traffic patternexhibited by GPUs. Specifically, all SMs communicatewith the limited number of memory controllers on chip.As the total on-chip bandwidth is fixed and is sharedby all SMs, more SMs means that each SM receives asmaller share of the network bandwidth. In addition, alarger network incurs longer delays due to increased hopcount and contention. As a result, there will be morenegative impact on the performance. We experimentedwith different SM scaling options using a perfect NoC (with zero delay), and the results are plotted in Fig-ure 3(b). We can observe that when the NoC impact isremoved, more applications (e.g.,
LPS, AES, CP, and SC ) achieve better performance with scale out settings.This means that, for applications that are sensitive tothe on-chip network performance, performance will ul-timately degrade when we keep scaling out the SMs. (2) Memory Locality and SM Scaling. It hasbeen observed that memory resources inside an SM af- igure 5: Rate of shared data in L1 caches ofneighboring SMs.Figure 6: Control divergence caused stalls withdifferent GPU scaling options. fect the performance of some applications. Some appli-cations may share data a lot among warps in one SM oramong L1 caches in different SMs. In such cases, scalingup will improve the utilization of shared memory andL1 cache, then reduce accesses to memory outside ofan SM. GPUs employ a mechanism called memory coa-lescing to reduce data movements. The idea is to com-bine multiple memory accesses from a warp to the samecache line into a single transaction. Larger SMs can ex-ecute larger warps, and provide more opportunities formemory coalescing. We quantitatively characterize thecoalescing effects in SMs with different scaling settings,as shown in Figure 4. In this figure, the y-axis showsactual memory access percentage of all load and storeinstructions after coalescing. As shown in Figure 4, ascale up design with 16 SMs has much lower memoryaccesses compared to a scale out design with 64 SMs.That is, as far as coalescing is concerned, scale up SMsbring more benefits than scale out SMs.In addition, recent GPU architectures combine datacache and shared memory functionality into a singlememory block to provide the best overall performance.This makes the actual L1 cache capacity several timeslarger when needed. For example, NVIDIA’s Volta [29]architecture has a combined capacity of 128 KB/SM,more than seven times larger than the GP100 data cache,and all of it are usable as a cache by programs that donot use shared memory. Considering this trend, we alsoinvestigated L1 data sharing among neighboring SMswith increased L1 capacity, and the results are plot-ted in Figure 5. As can be observed, some benchmarks(such as HW and ) exhibit around 10% sharingrate in the baseline configuration. When L1 capacityis increased by two or four times, higher sharing rateis observed in most benchmarks that exhibit data shar-ing. This means that scaling up SMs by increasing theL1 capacity can effectively reduce duplicated data andleads to more efficient utilization of the L1 caches. (3) Control Divergence and SM Scaling. Re-cent GPU architectures allow individual threads to fol-low distinct program paths with control flow on the SIMD pipeline. Control divergence occurs when threadsin the same warp take different paths upon a condi-tional branch, which can lead to significant performancedegradation because it increases pipeline stalls [30, 31].Even though various software techniques have been pro-posed to better schedule branch instructions [19, 20, 21],control divergence cannot be totally removed. We haveobserved the core inactivity caused by the control di-vergence, as shown in Figure 6. It can be seen fromthis plot that, for scale up SMs, pipeline stalls causedby branch instructions are much larger than scaling outSMs. In fact, for many benchmarks, the cores are stalledfor more than half of the time waiting for branch in-structions to be resolved. This is because, in largerSMs, the pipeline is wider than smaller SMs; as a re-sult, a pipeline stall causes more reduction in computa-tion parallelism. In this sense, applications with manycontrol instructions need to employ scale out SMs forbetter performance.
Ideally, threads running on GPUs are able to executein a lock step fashion, and consume continuous com-putation enabled by warp scheduling to avoid pipelinestalls. However, it has been shown that, for some appli-cations, control flow divergence and memory divergenceinside warps can significantly degrade performance bycausing stalls in SM pipelines [30, 31, 32]. Memorydivergence occurs when threads from a single warp ex-perience different memory-reference latencies caused bycache misses or accessing different DRAM banks. Incurrent organizations, the entire warp must wait untilthe last thread to have its reference satisfied. To solvethis problem, several techniques have been proposed todivide a warp into smaller slices and regroup them tocreate a new warp so that divergent threads do not pre-vent other threads from proceeding in execution [23, 16,33]. However, to our knowledge, all existing work sub-divides a warp and reorganizes the threads to build anew warp to run on the ”same sized” SM.There is a significant drawback of the above men-tioned techniques when implementing variable warp sizes:the SM needs to be subdivided to support the executionof a gang of split warps. For example, the gang-warp[16] needs to divide an SM into four slices and eachslice works, after splitting, as a small SM. There areprohibitive hardware overhead and design complexityissues in this type of approach. In addition, prior workonly considers resource utilization inside an SM, butnot across SMs. In contrast, we consider sharing amongSMs at a larger granularity. We also consider resourcessuch as NoC, L1 sharing, and coalescing among SMs,which have never been explored by prior work.In our proposed approach, we first observe the appli-cation’s scalability with SM resources such as networkand memory. If we detect that the application worksbetter with scale up cores, we fuse two small SMs intoone big SM. However, such a scheme fuses all SMs stati-cally and may not flexibly adapt to a program’s dynamicdivergence. For example, some control and memory di-ergence between the threads inside a warp may causelong stalls in the fused SM since the pipeline is muchwider now. Based on this observation, we propose to dy-namically split the scaled-up SM into two smaller SMsto handle the control and memory divergence within awarp. Once we detect that the divergence no longerexists, we fuse the two SMs back to one.Since the fused SM already consists of two sets ofexecution paths, there is no extra hardware needed tosupport slicing, as opposed to the prior work [23]. Itneeds to be noted that, we dynamically split and fuseSMs independently in this scheme. Fusing and splittingdecisions are made based on the current warp’s runningstatus, locally on each SM. As a result, when using ourapproach, at any given time during execution, the GPUarchitecture can have two types of SMs: some (fused)big SMs and some (split) SMs. Through this type of dy-namic heterogeneity , we are able to further improve re-source utilization and achieve better performance, overstate of the art.
Usually, reconfigurable architectures involve redesign-ing micro-architecture units, and this may lead to sig-nificant overhead if not handled carefully. Due to thisreason, there have not been many reconfigurable CPUarchitectures proposed in the past. However, in the caseof GPUs, the reconfiguration overhead can be muchlower. This is because GPU SMs have much simplerstructure and control logic, compared to general out-of-order CPU cores. Specifically, a GPU has a verysimple in-order pipeline which reduces the reconfigu-ration complexity. In addition, GPUs are designed tohide memory latency by overlapping the execution ofa large number of threads. As a result, delays causedby reconfiguration can be conveniently masked. Thismakes GPUs excellent candidates for reconfigurable ar-chitectures. Reconfiguration overhead also heavily de-pends on the granularity at which reconfiguration takesplace. In this work, we propose a coarse-grained recon-figurable architecture based on SMs, which can furtherreduce design complexity and overhead. Specifically, weonly reconfigure GPUs at an SM level without modify-ing pipeline structures. We only modify a few managedresources such as warp queues, L1 cache, and registerfiles. Therefore, the proposed GPU architecture is veryamenable to reconfiguration.
The goal of our design is to reduce resource under-utilization and also improve performance. To reducethe design complexity, we opt for coarse-grain reconfig-uration. Since it has been shown that individual kernelsexhibit regular behavior, we propose a one-time recon-figuration scheme on a kernel-by-kernel basis. Once akernel is determined to benefit from scale up SMs, wefuse every two neighboring
SMs to create scale up SMs.Otherwise, we continue executing the kernel using scaleout SMs. Our method is basically a top-down approach:
Figure 7: Reconfiguration controller overview. we first characterize the kernel’s overall scaling behav-ior regarding overall GPU resource utilization and thenmake a decision regarding whether to fuse or not. Basedon this static fusion scheme, we also propose to refinethe mechanism by allowing individual fused SMs to splitdynamically if warps exhibit significant divergence inthe fused SM.
A high level view of our reconfiguration controller isshown in Figure 7. Profiling has been employed bymany resource utilization techniques to determine anapplication’s characteristics [16, 24, 9]. In this work,we propose to combine online profiling with an offlinetrained model to predict scalability. When a new ker-nel starts, we first evaluate various metrics regardingits execution. Then, these metrics are fed into a scal-ability predictor which is already trained offline. Thescalability predictor gives a result indicating whetherthe kernel should be executed on scale up or scale outSMs. Next, we reconfigure the SMs according to thisresult and start executing the kernel. After the kernelfinishes, we start the loop again for the next kernel.
It has been shown that kernels exhibit disparate behav-ior with SM scalability and resource utilization [12, 34].Therefore, we cannot profile kernels to predict the be-havior of an entire application. Recall however that,each kernel is split into smaller blocks, called CTAs,that execute similar portions of the code. We foundthat the CTAs exhibit very consistent behavior, whichclosely follows the scalability trend at the kernel granu-larity. Figure 8 shows how CTAs follow the same scal-ing trend with their kernel using applications
LIB and
RAY . As can be observed, both the kernel and CTAs of
RAY show a scale up trend, whereas
LIB kernel and itsCTAs exhibit a scale out trend. Therefore, we proposeto use a CTA to predict the scaling behavior of a kernel.
To profile an application’s scalability respect to the SMsize and number, we need to identify metrics that caninfluence the scalability. Following are the major met-rics we considered in this work:1 (cid:13)
NoC throughput: This metric reflects the applica-tion’s ”communication intensity”. If the NoC is a bottle-neck, choosing scaled up cores will improve performancebecause the SM count would be smaller and the networksize would accordingly be smaller, resulting in each corehaving a higher network throughput.2 (cid:13)
Average NoC latency: This is the average latencyof the packets. It can also be used to evaluate the com-munication intensity. 3 (cid:13)
Coalescing rate: The coalesc-ing rate is calculated as the number of actual mem-ory accesses sent out from each SM divided by the igure 8: Kernel and CTA scalability consis-tency. total number of memory accesses in the instructions.This metric reflects how much shared data are requestedacross warps in an SM. 4 (cid:13)
L1 cache miss rate: This re-flects the demand for an application on local memory. Ifthe miss rate is high and the data is not streaming, allo-cating a larger L1 will improve the performance, whichmeans scale-up SMs are expected to have better perfor-mance. 5 (cid:13)
MSHR rate: This metric is similar to thecoalescing rate, but it is across different instructions.Scale up SMs will have more instructions running onthe fly, and this will benefit the applications with higherMSHR rates. 6 (cid:13)
Inactive thread rate: This is used toreflect the warp control divergence. It is calculated asthe number of cycles threads spent idling due to controlinstructions, divided by the total execution cycles. Ker-nels with larger control divergence would favor scale-outSMs.
In this work, we propose to use binary logistic regres-sion , which is a machine learning technique borrowedfrom the field of statistics, to predict scalability. Ourmodel accepts several input parameters and generates abinary output indicating whether an application needsto be run with scaled up GPUs or scaled out GPUs.Since we only fuse two neighboring SMs to build a scaleup core, we only need a simple regression based modelto predict scalability. The output of the model is onlyBinomial: yes or no to scale up.Binary logistic regression estimates the probabilitythat a characteristic is present (e.g., estimating the prob-ability of ”success”), given the values of explanatoryvariables. Unlike the normal distribution, the mean andvariance of the Binomial distribution are not indepen-dent. Specifically, the mean is denoted by P and thevariance is denoted by P ∗ (1 − P ) /n , where n is thenumber of observations, and P is the probability of theevent occurring (i.e. whether we need to reconfiguresmaller SMs into bigger SMs) in any one trial. If wewere considering the data in a list rather than a tableform, we would assume that the variable had a mean P ,and a variance P ∗ (1 − P ), and this variable would havea Bernoulli distribution. When we have a proportionas a response, we use a logistic or logit transformationto link the dependent variable to the set of explanatoryvariables. The logit link has the form: Logit ( P ) = log[ P/ (1 − P )] . (1)The term within the square brackets is the odds of anevent occurring. In our case, it indicates whether weneed to configure bigger cores. Using the logit scalechanges the scale of a proportion to plus and minus infinity, and also because of Logit ( P ) = 0, when P =0.5. When we transform our results back from the logit(log odds) scale to the original probability scale, ourpredicted values will always be at least 0 and at most1. If there is only one input x , then we can write themodel as: P = e ( b + b x ) e ( b + b x ) , (2)where y is the predicted output, b is the bias or inter-cept term, and b is the coefficient for the single inputvalue ( x ). We can write the model in terms of odds as: P − P = e b + b x . (3)Conversely, the probability of the outcome not occur-ring is 1 − P = 11 + e b + b x . (4)For an event with multiple input factors, the modeledlogarithm of the chance is given by:log( P − P ) = b + b x + b x + ... + b n x n + constant, (5)where P indicates the probability of an event (e.g.,chance to scale up by fusing SMs in our case), and P i are the regression coefficients associated with the refer-ence group and the x i explanatory variables. We trainthis binary logistic model using a large amount of offlineexperimental data to obtain the values of b − b n . Wethen use this model to directly infer the fusing decisiononline. Since the model is in fact linear, its implemen-tation overhead is quite low. We give more details ofthe overheads in later sections. The goal of AMOEBA is to create a GPU architecturethat can dynamically change the number and size of itsSMs, based on run-time workload behavior. We proposeto start with a ”baseline” scale out machine and fuse theneighboring SMs into a bigger SM, if the application isfound to perform better with scaling out. Note that weallow fusing only two neighboring
SMs. This is due tothe following considerations: (1) Our scale out SM has32 SIMD units and a scaled up SM will have 64 SIMDunits when two SMs are fused. Fusing more SMs wouldsignificantly increase the pipeline width and the prob-ability of pipeline stalls. In the future, if the scale outSM gets even smaller, for example, with 16 SIMD units,then fusing 4 such SMs together will be a more viableoption. Note that our techniques can be easily extendedto fusing more SMs to scale up. (2) Because fused SMsshare resources such as L1 cache, register files, and warpschedulers, fusing more SMs means increased commu-nication latency and implementation complexity. Forexample, a larger L1 cache will need a longer accesstime which will compromise the potential benefit fromthe SM fusion. Due to these reasons, we only considerfusing two neighboring SMs in this paper. igure 9: SM reconfiguration via fusion.
Figure 9 shows how two scaled out SMs are fused tocreate a scaled up SM. The dashed lines show the fusedunits of the two SMs, placed to ensure that they canwork in a lockstep fashion as one SM. The shaded com-ponents in SM1 are disabled due to SM fusion. In thefused SM, instructions are first fetched from the fusedL1 I-cache ( ). Then, the instructions are decoded,and selected instructions are sent to the per warp I-buffers ( ). Next, the control logic ( ) decides whichinstruction to issue and the decision is sent to the issueunit ( ). Selected warps are then sent to the datapathof both SM0 ( ) and SM1 ( ) for execution. Mem-ory accesses are sent from the executing threads to thefused memory unit ( ).In Figure 9, there are two baseline SMs, shown asSM0 and SM1. AMOEBA does not change the execu-tion units such as SP or SFU. When fused, the regis-ter files of the two original SMs and score boards workindependently, as in the baseline. AMOEBA does notchange register files, and since the register files, are allo-cated with warps, they are not fused but can be accessedindependently. Thus, there is no change in the through-put of any individual register file. Similarly, the scoreboard connection with each register file is not modifiedeither. However, the connection of the score board inSM1 to the warp scheduler is removed when two SMsare fused ( ). Instead, this score board is connectedto the warp scheduler of SM0. This is because when wefuse two SMs, only one warp scheduler is kept, and itschedules all warps on both the SMs ( ).The memory components of the two SMs need tobe fused, and this includes the shared memories, L1I caches, L1 D caches, and L1 context cache. We fuseL1 caches by increasing the cache associativity. To re-duce the new L1 cache access latency, the SM layoutneeds to be modified as shown in Figure 9, so that theL1 caches of both SMs are placed next to each other( , ). Since the GPUs are good at hiding memoryaccess latencies through overlapped warp execution, theextra delay caused by accessing a larger L1 D cache canbe hidden by warp computation. In our experiments,we conservatively added one extra cycle in L1 cache ac-cess due to the cache fusion. Our results show thatthis extra delay is hidden quite well by the overlappedcomputation.Each fused SM has one copy of the coalescing unit in Figure 10: Mechanism for switching betweenfusing and splitting.Figure 11: Algorithm to dynamically split afused SM to accommodate warp heterogeneity. the fused core by fusing the two coalescing units fromboth the SMs. Since the warp size is doubled, this leadsto more chances for coalesced memory accesses. Af-ter fusing the two SMs, AMOEBA combines the NoCrouters of the two SMs into one by disabling one SM’srouter. This is implemented by adding a bypass pathin one disabled router. As a result, the network sizeis reduced, this significantly reduces the network la-tency, and consequently, each router can enjoy a higherthroughput in the network.
We propose to fuse SMs to reconfigure the GPU as ascaled up architecture when we observe that fusing theresources of two SMs is beneficial from a performanceangle. It needs to be noted that our approach is dif-ferent from prior works such as variable warps [30] orwarp subdivision [33]. Those works only consider theresources inside an SM and try to fully utilize them –here is no cross-SM resource utilization optimizationin those prior studies. Our proposed architecture, onthe other, hand takes into account cross-SM resourceutilization, such as NoC resource, sharing L1 caches be-tween SMs, and memory access coalescing across SMs.As a result, it is fundamentally different from the earlierworks.However, there are still opportunities to further im-prove resource utilization in AMOEBA. This is, whenwe fuse two SMs, there can be scenarios where warpheterogeneity can cause inefficient pipeline utilization.For example, even though fusing two SMs can bringbenefits in cache access or NoC, the resulting largerwarp size creates wider pipelines. In this case, diver-gence in memory or control behavior in warps could leadto more pipeline stalls, compared to the unfused SMs.Therefore, we propose a dynamic SM splitting strategy:when we observe a significant warp divergence, and widepipeline leads to a higher performance degradation com-pared to the benefits from fusion, we split the fused SMinto two separate SMs. In this way, each split SM hashalf the pipeline width and the warps that cause diver-gence can only cause stalls in one of the smaller SMs.The other SM can keep the computation without beingdelayed by the pipeline stalls.We can have different policies to decide when to splita given ”fused” SM into two independent ones. Notethat, by ”independent”, we mean that two SMs are run-ning different warps independently on their respectivedata paths. However, to reduce the cost of hardwareand context switch, we do not split the shared resources,such as L1 cache, register files, and NoC interface. Weset up a threshold to decide when to split, which is afixed ratio of divergent warps to the total warps runningin the large SM. If the current ratio is greater than thethreshold, we decide to split the SM into two. This fig-ure also shows how NoC interfaces are bypassed whentwo SMs are fused together.After the SM splits, we move all divergent warps fromthe bin to a new SM created from the split. Subse-quently, the two SMs start the independent executionof their warps. When the second SM finishes all diver-gent warps, we re-fuse the two SMs into one. Then,we start the procedure to collect divergent warps againand split the SMs when necessary. Thus, this proce-dure of splitting and fusing is dynamically decided bythe divergence of warps. This mechanism is expectedto maximize resource utilization and reduce stalls in thefused SMs.The idea behind splitting is to prevent divergent warpsfrom causing pipeline stalls. So, we need to separate di-vergent warps and non-divergent warps into two clustersand run each cluster on a separate smaller SM, so thatthe slow warps do not cause the fast warps to delay.Suppose that we have split a scale up SM into 2 scaleout SMs (SM 0 and SM 1), and then, we want to runfast warps on SM 0 and slow warps on SM 1. Therecan be different mechanisms that can be used to decidewhat warps to be moved to the second SM 1. In thiswork, we investigated two methods: (1) direct split , Figure 12: Performance results. and (2) warp regrouping . The direct split method issimple as it directly divides a divergent warp in the mid-dle into 2 smaller warps. Then, both the smaller warpsare moved to SM 1. This method has a low cost butmay not have optimal performance. This is because theslow threads in a divergent warp may be located in dif-ferent positions. If we simply cut the warp in half, therecan be varying combinations of resulting warps. For ex-ample, we can have one warp with all fast threads andone warp with all slow threads. Or, we can have bothsmaller warps with partially slow threads. The idealcase is the first splitting, since we can better removenegative effects of the slow threads on the fast ones.Based on this analysis, we propose a second methodthat regroups threads into a fast warp and a slow warp.We then move the slow warp to SM 1 and keep the fastwarp in SM 0. To accomplish this, we first divide thethreads in the original warp into small groups, and labelthem as ”fast” or ”slow” based on divergence. Then, weregroup them into two warps so that the slowest groupsare all put into a slow warp and moved to SM 1. In ourdesign, we also periodically check the stalls in the slowSMs. We periodically move some fast warps to them sothat the resources are not wasted when the slow warpscause stalls.The hardware overhead of the splitting is low becausethe split SMs were anyway two independent SMs in thebaseline architecture. We added hardware to fuse themas described earlier, and splitting them does not needextra hardware, except the management and storage ofthe divergent warps. Therefore, we need a new warpqueue and some simple control logic. Compared to theprior works [30, 33] that proposed splitting resourcesinside one SM, our overhead is very low. Figure 10 andFigure 11 show the timing and algorithm of our dynamicsplitting and fusing.
Table 1: System configuration. See GPGPU-Sim v3.2.2 [35] for the full list.
Number of Computing Cores 48 coresNumber of Memory Controllers 8MSHR per Core 64Warp Size 32SIMD Pipeline Width 8Number of Threads per Core 1024Number of CTAs/Core 8Constant Cache Size/Core 8KBTexture Cache Size/Core 8KBL1 Cache Size/Core 16KBL2 Cache Size/Core 128KBNumber of Registers/Core 16384Warp Scheduler Greedy-Then-OldestShared Memory 48 KBMemory Scheduler FR-FCFSMemory Model 8 MCs, 924 MHzNoC Channel Width 128 bitNoC Topology meshNoC Router Pipeline Stage 2 igure 13: Control divergence caused stalls.
We simulate our baseline architecture using a cycle-levelsimulator (GPGPU-Sim [36]) and faithfully model allkey parameters (Table 1). The baseline GPU consistsof 48 scale out SMs with a warp size of 32. There are8 memory controllers on the chip. The interconnectionnetwork is a mesh-based NoC. There are two subnetsto avoid deadlock between request and reply messages.The router has a pipeline with 2 stages. When we per-form reconfiguration, two baseline SMs are fused to cre-ate one scale up SM. We use a wide range of GPU ap-plications from Ispass [37], Rodinia [38], Polybench [39]and Mars [40], to evaluate our design, and execute allapplications to completion. We report performance re-sults using the geometric mean of IPC speedup (over thebaseline GPU). We also report other evaluation metricsprovided by the simulator such as L1 cache miss rate,NoC latency, network injection rate, and SM idle rate.
Figure 12 illustrates the performance gains when usingAMOEBA. The baseline is a scale out architecture andwe also experiment with direct scale up. We present theperformance of applying three techniques proposed byAMOEBA: static fuse configures the SMs only once be-fore a kernel’s execution. Using the prediction model,AMOEBA predicts the scalability of application withSMs, and chooses to fuse two SMs or not. The next twotechniques are based on the dynamic heterogeneous SMscaling. Direct split simply divides a divergent warpinto smaller ones in the middle, whereas warp regroup-ing employs more complicated techniques to re-organizethreads into a fast warp and a slow warp. As can be ob-served, SM achieves the highest improvement in perfor-mance, by 4.25 times. MUM also achieves a significantperformance improvement of 2.11 times. On average,all 12 benchmarks have around 47% increase in IPC.Static fuse achieves almost same performance as di-rect scale up when larger SMs can bring performancebenefits. For applications that can benefit from largerSMs, static fuse achieves almost the same performancegain as direct scale up. However, some benchmarks pre-fer scale out configurations, such as and
ATAX .Our fusing techniques all perform better than directscale up (about 10%) for these workloads. This showsthat AMOEBA can accurately predict the applications’scalability and the correct reconfiguration can lead toperformance gains. Some workloads are not sensitiveto scaling such as
FWT and KM and all AMOEBAtechniques perform similar to the baseline. In gen-eral, direct split and static fuse bring similar benefits(on average) for most workloads, except BFS and SM .Some workloads such as WP even experience perfor- Figure 14: L1-I cache miss rate.Figure 15: L1-D cache miss rate. mance degradation, which is mainly due to the fusionoverhead. This is because this technique cannot dynam-ically react to workload behavior changes. On the con-trary, warp regrouping achieves 16% performance gainthan direct split because it can accurately capture aworkload’s dynamic behavior caused by divergence.
Figure 13 plots the SM inactive rate caused by con-trol divergence which is defined as the fraction of cyclesthat SMs are stalled due to control instructions. Wecan observe that only part of the workloads suffer fromstalls caused by control divergence. For workloads thathave control divergence caused stalls, dynamic fusionperform better than direct scale up and static fusingbecause they can dynamically adjust to the changes incontrol divergence. Warp regrouping performs better inmore cases than direct split because fast and slow warpsare allocated to different SMs. Among all cases, thebaseline scale out configuration has the least amount ofstalls because its pipeline width is always smaller thanthe other configurations.
L1-I cache miss rate is plotted in Figure 14. Somebenchmarks such as
FWT , , and ATAX are notsensitive to L1-I cache capacity and fusing does not leadto any change in their behavior. However, most bench-marks have their miss rates reduced and the averagereduction is 9%, 20% and 30% for the three AMOEBAschemes. Sharing L1-I cache through SM fusion reducesthe I cache misses and thus leads to improved perfor-mance. Figure 15 plots the miss rate of L1-D cache. Themost significant reduction is for SM and its miss rate isreduced by more than 70%. This is because the shar-ing of L1 cache increases its effective capacity and thischange directly leads to 4.25 times improvement in per-formance. Some benchmarks, such as BFS and
MUM ,experience increased L1-D cache miss rates. This is be-cause warp regrouping changes data locality by movingwarps between SMs and this leads to higher miss rates.Impact of AMOEBA on memory accesses is plotted inFigure 16. As can be observed, all benchmarks achievereduced actual memory access rates compared to thebaseline. Actual memory access rate is calculated asthe actual memory access count divided by the totalnumber of memory accesses in the instructions. SinceAMOEBA allows SMs to share coalescing units, the ac-tual number of loads and stores is greatly reduced. igure 16: Actual memory access.Figure 17: Normalized rate of stalls when MCscannot inject to the NoC.
Figure 17 plots the normalized ICNT stall rate, whichis defined as the rate of stalls when new reply packetscannot be generated because an MC’s injection queuesare full. This data can reflect the pressure on bothNoC and memory controllers. As can be observed fromthis figure, all AMOEBA schemes are able to reducethis stall rate. For some benchmarks, such as
CORR and
COVR , this stall time is totally removed. SinceAMOEBA can fuse SMs and bypass some routers, thenetwork size is reduced, and this leads to smaller hopcounts. As a result, NoC bottleneck can be greatly re-lieved for communication-intensive applications. Fig-ure 18 shows the average network data injection ratesfor the SM configurations evaluated. As can be observedfrom this plot, all benchmarks have a higher injectionrate under the AMOEBA than the baseline. This isbecause we fuse SMs and use only one NoC network in-terface to inject packets. Even though the injection rateis higher in AMOEBA schemes, the network size is re-duced by half and this leads to shorter communicationdelays, paving the way to achieve better performance.
To observe the dynamics of switching between fusingand splitting, we studied the status of five SMs in bench-mark
RAY . The results are shown in Figure 19. Asshown in this figure, all 5 SMs start with fused execu-tion because this benchmark favor scale up SMs. Aftera period of time, the SMs start to split because enoughdivergent warps have been detected and smaller SMsbrings more benefit. However, the switching betweenfusing and splitting of each SM is independent of eachother. As a result, at a certain time, there exist bothscale up and scale out SMs in the architecture. As a re-sult, better performance results are achieved from thisflexible heterogeneity in SM configurations provided byAMOEBA.
We use several performance counters to generate thedetailed metrics required by our scalability predictionmodel. Most of these performance counters are alreadyincluded in many of today’s GPU systems, includingcache hits and misses, MSHR, and branch instructionstatistics. For metrics cannot currently be provided bythe performance counters, we propose to add such coun-
Figure 18: NoC injection rate.Figure 19: Phases of dynamic SM fusion andsplitting. ters, e.g., concurrent CTA numbers. Table 2 shows thecoefficients in our scalaiblity prediction model.To analyze the relative contribution of each metricto overall performance in the prediction model, we plotthe distributed weights of the major metrics. Here, weconsolidate different types of L1 cache miss rate intoone metric called
L1 miss rate . The result is shown inFigure 20. For each metric, its magnitude of impact isshown as a value between -1 to 1. The magnitude of im-pact of a metric is calculated as the coefficient of thismetric × measured value . For example, the impact mag-nitude of Load instruction is calculated as Load insn rate × its coefficient . All positive impact magnitudes con-tribute to a scaling up decision, and all negative impactmagnitudes contribute to a scaling out decision. Even-tually we add all metrics’ impact magnitudes togetherand check the sum. If the result is positive, then wepredict to fuse SMs and create a scale up configuration.Otherwise, we predict that a scale out configuration willfit better with the application. In this figure, the sumof the impact magnitudes for BFS and
RAY are bothpositive. So, these benchmarks favor running on scaleup SMs. On the contrary, CP and PR prefer to run onscale out SMs. It can also be observed that different ap-plications’ scalability is influenced by different metricswith varying extent. For example, MSHR plays a moresignificant role in BFS and CP , whereas PR and RAY are more sensitive to the NoC performance comparedto others.
We now compare the performance of AMOEBA againstDynamic Warp Subdivision (DWS) [33]. The resultsare plotted in Figure 21. DWS was proposed by Menget. al. to divide warps into smaller ones in order toreduce the stalls caused by memory and control diver-gence. On average, AMOEBA achieves 27% perfor-mance gain over DWS. Benchmark SM achieves 3.97times improvement in performance compared to DWS.This is because DWS can only improve resource utiliza-tion inside an SM and cannot harness the benefits ofcross-SM resource sharing. In contrast, AMOEBA candynamically change the configurations of SMs and thusflexibly allow resources to be shared among SMs. Thus, able 2: Coefficients in scalability predictionmodel. Constant -73.635 Concurrent cta 1.414Control Diver-gent 444.628 Coalescing 2057.050L1D Miss Rate -313.838 L1I Miss Rate 1674.513L1C Miss Rate -67.277 MSHR -102.971Load Inst Rate -680.786 Store Inst rate -804.7NoC -8.301
Figure 20: Magnitude of parameter impact ondetermining scalability for some applications us-ing the proposed predictor. performance can be further improved through enhancedresource utilization.
There are two types of controllers in the proposed archi-tecture: online reconfiguration controller for scale up orscale out, and switch controller for dynamic fusing andsplitting. We propose to implement these controllers inan IP module in the GPU chip. The major componentsin the controllers are a MAC unit, buffers and controllogic. We employ similar methods proposed in [32] tomodel the buffers in the controllers by using the areaof a latch cell from the NanGate 45 nm Open Cell li-brary. The resulting area of each bit of the buffer is 4.2um , and the total estimated added buffer area is 0.021mm . We use a pipelined Booth Wallas MAC [41] andit is synthesized by Synopsis Design Compiler using 90nm technology and scaled to 45 nm. The area of theMAC is 0.019 mm . Together with the control logic,we estimate the two controllers to have area of 1.53mm . GeForce 8800GTX which has 128 SM cores, theoverall area overhead of AMOEBA can be calculatedas the total SM area overhead + controller overhead =0.021 mm ×
128 + 1.52 mm = 4.208 mm . Com-pared to the total GeForce 8800GTX area of 480 mm ,AMOEBA incurs an area overhead of 0.88%. There has been plenty of work proposing reconfigurablearchitectures for multi-core CPU systems [25, 26, 27,24, 42]. A multicore architecture is proposed in [25]that reconfigures cores into a wide VLIW machine toexploit hybrid forms of parallelism. As a pioneer re-configurable architecture, TRIPS [26] splits ultra-largecores to small ones to meet the diverse demand of appli-cation parallelism. Working in the opposite direction ofreconfiguring cores, Ipek et al. [24] proposed Core Fu-sion where a large core can be dynamically configuredfrom a group of independent smaller cores. Core Fusionis the most closely related work to AMOEBA, but it isproposed for CPU cores and the core fusing policy andmicro architecture are very different from our work.
Figure 21: Comparison with Dynamic WarpSubdivision (DWS) [33].
Compared to CPU based multicore systems, therehave been fewer works on reconfigurable GPU architec-tures. Voitsechov et al. proposed SGMF, a dataflow ar-chitecture using coarse-grain reconfigurale fabric, com-posed of a grid of interconnected functional units [13].However, SGMF needs help from compiler to break theCUDA/OpenCL kernels into dataflow graphs and in-tegrates the control flow of the original kernel to pro-duce a control-data-flow-graph (CDFG). Different fromtheir work, our proposed scheme does not require com-piler support. R-GPU is a reconfigurable GPU archi-tecture that aims to reduce the cycles spent on datamovement and control instructions and focus on data-computations [14]. It configures GPU cores to createa spatial computing architecture. R-GPU implementsreconfiguration at a core level within an SM, and doesnot consider an application’s scalability while our workreconfigures at an
SM level , and our reconfiguration de-cision is based on NoC, control instructions and mem-ory access patterns. Dhar et al. proposed fine grainedand coarse grained reconfigurations of SMs in GPUs inorder to reduce the underutilization of resources andpower consumption [15]. However, their work only re-configures the datapath inside each SM. Our work alsoreconfigures memory and NoC of the system, and wealso propose to use heterogeneous SMs to improve per-formance and power efficiency.Heterogeneous multicores have emerged as a promis-ing approach for CPU-based systems which leveragecores with different capabilities and complexities to strikea balance between performance and power [9, 43, 44,45, 46, 47, 48]. Lukefahr et al. propose compositecores that consist of big and small compute engines[9]. Kumar et al. [44] proposed a heterogeneous multi-core architecture to reduce power dissipation. Hill etal. showed that there is great potential in performanceimprovement of the serial sections of an application us-ing heterogeneous cores [43]. Our proposed AMOEBAarchitecture differs from these heterogeneous architec-tures in that our heterogeneous cores are dynamicallyconfigurable while these earlier works employ fixed coreconfigurations. Our design can provide more flexibilityin exploring heterogeneous architectures, and achievebetter resource utilization.Recently, several approaches have been proposed forimproving GPU resource utilization [12, 16, 17, 18, 49,50]. Wang et al. propose Simultaneous Multikernel(SMK) by exploiting heterogeneity of different kernels[12]. Park et al. proposed GPU Maestro that per-forms dynamic resource allocation for efficient utiliza-tion of multitasking GPUs [16]. Wang et al. proposean application-aware TLP management techniques fora multi-application execution environment in order tomake judicious use of shared resources [17]. To im-rove resource utilization in concurrent kernel execu-tion (CKE), Dai et al. proposed mechanisms to reducememory stalls [18]. Our proposed work is different fromthese prior techniques because it reconfigures SMs sothat they scale according to the application’s dynamicbehaviour.
In this work, we propose a reconfigurable GPU archi-tecture, called AMOEBA, to explore the design space ofGPU scaling. By predicting a given application’s scal-ability with SM size, the proposed architecture is ableto dynamically configure scale up or scale out SMs inorder to achieve high performance and resource utiliza-tion. We also propose an optimization strategy to fur-ther reconfigure each SM based on the warp divergenceat run- time, resulting in a heterogeneous architecturein which both scale up and scale out SMs co-exist at run-time. Our evaluation results using various benchmarkprograms demonstrate the effectiveness of AMOEBA inreducing GPU resource under-utilization and improvingsystem performance and power efficiency. [1]
Green500 list
Top500 list
Amazon Web Service . https://aws.amazon.com/ec2.[4] D. Luebke and G. Humphreys, “How gpus work,” in
Computer ( Volume: 40 , Issue: 2 , Feb. 2007) , 2007.[5] A. Prakash, H. Amrouch, M. Shafique, T. Mitra, andJ. Henkel, “Improving mobile gaming performance throughcooperative cpu-gpu thermal management,” in
Proceedingsof 53nd ACM/EDAC/IEEE Design AutomationConference (DAC) , 2016.[6]
Nvidia . Programming Guide, 2014.[7] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov,O. Nutlu, and Y. N. Patt, “Improving gpu performance vialarge warps and two-level warp scheduling,” in
Proceedingsof 44th Annual International Symposium onMicroarchitecture , 2011.[8] J. J. K. Park, Y. Park, and S. Mahlke, “Elf: Maximizingmemory level parallelism for gpus with coordinated warpand fetch scheduling,” in
Proceedings of SC15 , 2015.[9] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman,R. Dreslinkski, T. F. Wenisch, and S. Mahlke, “Compositecores: Pushing heterogeneity into a core,” in
Proceedings ofthe 45th Annual International Symposium onMicroarchitecture , 2012.[10] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet,G. Lemieux, P. P. Pande, C. Grecu, and A. Ivanov,“System-on-chip: Reuse and integration,” in
Proceedings ofIEEE, Vol. 94, No. 6, June 2006 , 2006.[11] A. Bakhoda, J. Kim, and T. M. Aamodt,“Throughput-effective on-chip networks for manycoreaccelerators,” in
Proceedings of 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture , 2010.[12] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, andM. Guo, “Simultaneous multikernel gpu: Multi-taskingthroughtput processors via fine-grained sharing,” in
Proceedings of IEEE International Symposium on HighPerformance Computer Architecture (HPCA) , 2016.[13] D. Voitsechov and Y. Etsion, “Single-graph multiple flows: Energy efficient design alternative for gpgpus,” in
Proceedings of the 41st International Symposium onComputer Architecture (ISCA) , 2014.[14] G. V. D. Braak and H. Corporaal, “R-gpu: a reconfigurablegpu architecture,” in
ACM Transations on Architecture andCode Optimization, Vol.0, No. 0, Article 0 , 2015.[15] A. Dhar, “The case for reconfigurable general purpose gpucomputing,” in
Master Thesis, University of Illinois atUrbana-Champaign , 2014.[16] J. Park, Y. Park, and S. Mahlke, “Dynamic resourcemanagement for efficient utilization of multitasking gpus,”in
Proceedings of ASPLOS , 2017.[17] H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog,“Efficient and fair multi-programming in gpus via effectivebandwidth management,” in
Proceedings of IEEEInternational Symposium on High Performance ComputerArchitecture (HPCA) , 2018.[18] H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, andH. Zhou, “Accelerate gpu concurrent kernel execution bymitigating memory pipeline stalls,” in
Proceedings of IEEEInternational Symposium on High Performance ComputerArchitecture (HPCA) , 2018.[19] C. basaran and K. D. Kang, “Supporting preemptive taskexecutions and memory copies in gpgpus,” in
Proceedings ofEuromicro Conference on Real-Time Systems , 2012.[20] S. Kato, K. Lakshmanan, R. R. Rajkumar, andY. Ishikawa, “Timegraph: Gpu scheduling for real-timemulti-tasking environments,” in
Proceedings of the 2011USENIX conference on USENIX annual technicalconference , 2011.[21] J. T. Adriaens, K. compton, N. S. Kim, and M. J. schutle,“The case for gpgpu spatial mutlitasking,” in
Proceedings ofthe 18th HPCA , 2012.[22] C. J. Rossback, J. currey, M. silberstein, B. Ray, andE. Witchel, “Ptask: Operating system abstrations tomanage gpus as compute devices,” in
Proceedings of the23rd ACM Symposium on Operating System Principles ,2011.[23] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram,“Warped-slicer: Efficient intra-sm slicing through dynamicresource partitioning for gpu multiprogramming,” in
Proceedings of the 43rd Annual International Symposiumon Computer Architecture , 2016.[24] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Corefusion: Accommodating software diversity in chipmultiprocessors,” in
Proceedings of the InternationalSymposium on Computer Architecture (ISCA) , 2007.[25] S. A. Lieberman and S. A. Mahlke, “Extending multicorearchitectures to exploit hybrid parallelism in single-threadapplications,” in
Proceedings of HPCA , 2007.[26] K. sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting ilp,tlp and dlpp with the polymorphous trips architecture,” in
Proceedings of International Symposium on ComputerArchitecture , 2003.[27] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, andM. Horowitz, “Smart memories: a modular reconfigurablearchitecture,” in
Proceedings of International Symposiumon Computer Architecture volta-architecture-whitepaper. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.[30] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt,“Dynamic warp formation and scheduling for efficient gpucontrol flow,” in
Proceedings of 40th Internationalsymposium on Microarchitecture , 2007.[31] T. D. Han and T. S. Abdelrahman, “Reducing branchdivergence in gpu programs,” in
Proceedings of GPGPU-4workshop , 2011.32] T. Rogers, D. R. Johnson, M. O’Connor, and S. W.Keckler, “A variable warp size architecture,” in
Proceedingsof ISCA , 2015.[33] J. Meng, D. Tarjan, and K. Skadron, “Dynamic warpsubdivision for integrated branch and memory divergencetolerance,” in
Proceedings of ISCA , 2010.[34] A. Jadidi, M. Arjomand, M. Kandemir, and C. Das,“Optimizing energy consumption in gpus throughfeedback-driven cta scheduling,” in
Proceedings ofSpringSim (HPC) 2017: 12:1-12:12 , 2017.[35]
GPGPU-Sim v3.2.2 (2016) GTX 480 Configuration. https://github.com/chenxuhao/gpgpu-sim-ndp/tree/master/configs/GTX480.[36] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, andT. M. Aamodt, “Analyzing cuda workloads using a detailedgpu simulator,” in
IEEE International Symposium onPerformance Analysis of Systems and Software , 2009.[37] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, andT. M. Aamodt, “Analyzing cuda workloads using a detailedgpu simulator,” in
IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS) ,2009.[38] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,S. Lee, and K. Skadron, “Rodinia: A benchmark suite forheterogeneous computing,” in
IEEE InternationalSymposium on Workload Characterization (IISWC) , 2009.[39] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, andJ. Cavazos, “Auto-tuning a high-level language targeted togpu codes,” in
Innovative Parallel Computing (InPar) ,2012.[40] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang,“Mars: A mapreduce framework on graphics processors,” in
International Conference on Parallel Architectures andCompilation Techniques (PACT) , 2008.[41] N. Kumar, M. Bansal, and A. Kaur, “Speed power and areaefficent vlsi architectures of multiplier and accumulator,” in
International Journal of Scientific and EngineeringResearch Volume 4, Issue 1, January-2013 , 2013.[42] C. Kim, S. Sethumadhavan, M. S. Govindan,N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler,“Composable lightweight processors,” in
Proceedings fo theInternational Symposium on Microarchitecture , 2007.[43] M. Hill and M. Marty, “Amdahl’s law in the multicore era,”in
IEEE Computer, 41(7) , 2008.[44] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, andD. M. Tullsen, “Single-isa heterogeneous multi-corearchitectures: The potential for processor power reduction,”in
Proceedings of the International Symposium onMicroarchitecture , 2003.[45] P. Greenhalgh, “Big.little processing with arm cortex-a15i& cortex-a7,” in ,2011.[46] M. Annavaram, E. Grochowski, and J. Shen, “Mitigatingamdahl’s law through epi throttling,” in
Proceedings of the32nd International Symposium on Computer Architecture ,2005.[47] R. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “Theimpact of performance asymmetry in emerging multicorearchitectures,” in
Proceedings of the 32nd InternationalSymposium on Computer Architecture , 2005.[48] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt,“Accelerating critical section execution with asymmetricmulti-core architectures,” in
Proceedings of ASPLOS , 2009.[49] Y. Oh, G. Koo, M. Annavaram, and W. W. Ro,“Linebacker: Preserving victim cache lines in idle registerfiles of gpus,” in
ISCA , 2019.[50] A. Pattnaik, X. Tang, O. Kayiran, A. Jog, A. Mishra,M. T. Kandemir, A. Sivasubramaniam, and C. R. Das,“Opportunistic computing in gpu architectures. inproceedings of the 46th international symposium on computer architecture,” in