[PDF] AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU Scaling

Abstract

Different GPU applications exhibit varying scalability patterns with network-on-chip (NoC), coalescing, memory and control divergence, and L1 cache behavior. A GPU consists of several StreamingMulti-processors (SMs) that collectively determine how shared resources are partitioned and accessed. Recent years have seen divergent paths in SM scaling towards scale-up (fewer, larger SMs) vs. scale-out (more, smaller SMs). However, neither scaling up nor scaling out can meet the scalability requirement of all applications running on a given GPU system, which inevitably results in performance degradation and resource under-utilization for some applications. In this work, we investigate major design parameters that influence GPU scaling. We then propose AMOEBA, a solution to GPU scaling through reconfigurable SM cores. AMOEBA monitors and predicts application scalability at run-time and adjusts the SM configuration to meet program requirements. AMOEBA also enables dynamic creation of heterogeneous SMs through independent fusing or splitting. AMOEBA is a microarchitecture-based solution and requires no additional programming effort or custom compiler support. Our experimental evaluations with application programs from various benchmark suites indicate that AMOEBA is able to achieve a maximum performance gain of 4.3x, and generates an average performance improvement of 47% when considering all benchmarks tested.

Full PDF

AAMOEBA: A Coarse Grained Reconﬁgurable Architecturefor Dynamic GPU Scaling

Xianwei Cheng

Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Hui Zhao

Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Kandemir

Computer Science and Engineering DepartmentPennsylvania State University [email protected] Beilei Jiang

Computer Science and Engineering DepartmentUniversity of North Texas [email protected] Mehta

Electrical Engineering DepartmentUniversity of North Texas [email protected]

ABSTRACT

Diﬀerent GPU applications exhibit varying scalabilitypatterns with network-on-chip (NoC), coalescing, mem-ory and control divergence, and L1 cache behavior. AGPU consists of several Streaming Multi-processors (SMs)that collectively determine how shared resources arepartitioned and accessed. Recent years have seen diver-gent paths in SM scaling towards scale-up (fewer, largerSMs) vs. scale-out (more, smaller SMs). However, nei-ther scaling up nor scaling out can meet the scalabilityrequirement of all applications running on a given GPUsystem, which inevitably results in performance degra-dation and resource under-utilization for some applica-tions. In this work, we investigate major design pa-rameters that inﬂuence GPU scaling. We then proposeAMOEBA, a solution to GPU scaling through reconﬁg-urable SM cores. AMOEBA monitors and predicts ap-plication scalability at run-time and adjusts the SM con-ﬁguration to meet program requirements. AMOEBAalso enables dynamic creation of heterogeneous SMsthrough independent fusing or splitting. AMOEBA isa microarchitecture-based solution and requires no ad-ditional programming eﬀort or custom compiler sup-port. Our experimental evaluations with applicationprograms from various benchmark suites indicate thatAMOEBA is able to achieve a maximum performancegain of 4.3x, and generates an average performance im-provement of 47% when considering all benchmarks tested.

GPUs have emerged as performance accelerators forgeneral purpose applications and take advantage of thesingle instruction multiple threads (SIMT) program-ming model to improve the performance of data-parallelcomputations. Supercomputers [1, 2], cloud servers [3],desktops [4] and even mobile devices [5] already beneﬁtsigniﬁcantly from GPUs to achieve better performanceand higher power eﬃciency. A GPU typically consists ofmany compute units (CUs), also called streaming multi-processors (SMs), and each SM contains a large numberof simple compute cores [6]. GPUs leverage the massive number of computing cores in SMs to exploit thread-level parallelism (TLP) in an attempt to hide memoryaccess latency [7, 8].The multiprocessor industry has been fueled by Moore’sLaw for many years and processor performance is im-proved through increasing transistor counts. However,Moore’s Law is slowing down because we are reachingthe technological limits of how small transistors can bemade. Increasing the chip size can allow us to add moretransistors to a chip. However, this is not a sustainablesolution due to several reasons: (1) there is not enoughpower budget to allow all transistors to be poweredon simultaneously. This is because transistor thresh-old voltage does not scale with technology nodes andper-transistor switching energy almost keeps constant[9], (2) cost becomes prohibitive in manufacturing chipswith ultra-high transistor counts [10], and (3) data com-munication becomes a bottleneck as chip size increases[11].TLP has been considered as a promising solution totackle the slowdown of Moore’s law, and GPU archi-tecture is based on the idea of exploiting TLP. Thecomputing power of GPU arises from its SIMT archi-tecture: many threads are executed concurrently in anSIMD fashion. However, an average programmer maynot be aware of the details of the underlying hardwareto write high-quality code to fully utilize available GPUresources. As a result, many general purpose applica-tions are not fully optimized for running on a speciﬁcGPU architecture, and this causes under-utilization ofhardware resources [12, 13, 14, 15, 16, 17, 18]. Forexample, it has been observed that cores are idle for52% to 98% of the execution time for some GPU bench-marks [12]. Therefore, instead of adding more resourcesto GPUs, exploration of optimized resource utilizationtechniques can be a more viable option to enhance theGPU performance and power eﬃciency.There have been earlier eﬀorts targeting to maximizeGPU resource utilization. For example, several priorworks have proposed to share a GPU among multi-ple applications and used software-level techniques tomanage resource sharing [19, 20, 21]. On the hardware a r X i v : . [ c s . A R ] N ov ide, spatial multitasking has been proposed as a tech-nique that partitions a GPU among multiple kernels atan SM granularity [22]. Several techniques have beenalso proposed to share resources among kernels insideeach SM, such as simultaneous multikernel (SMK) [12],warp-slicer [23], and GPU Maestro [16]. However, thesetechniques need to run multi-programmed workloads tofully utilize GPU resources (i.e., ﬁnding more tasks toavoid GPU inactive cycles). Also, they do not consideran application’s scalability and do not conﬁgure hard-ware to meet the software’s resource demands.An alternative approach is to dynamically reconﬁgure hardware to avoid resource under-utilization. Reconﬁg-urable cores have been proposed in the past for CPUsto facilitate parallel execution [24, 9, 25, 26, 27]. How-ever, the overhead of reconﬁguring CPU cores is highdue to the complexities associated with CPU architec-tures. There have been very few reconﬁgurable GPUarchitectures proposed. For example, R-GPU is pro-posed to interconnect compute cores inside an SM toreduce data movement and remove decoding overheadby assigning each core a ﬁxed operation [14]. In compar-ison, SGMF is proposed as a dataﬂow architecture usingcoarse-grain reconﬁgurable fabric [13]. However, SGMFneeds help from compiler to convert the kernels intodataﬂow graphs. Neither work considers the scalabilityof applications regarding system bottlenecks, such asinterconnection network, memory access patterns, andcontrol divergence. In addition, these prior works onlyexplore intra-SM resource utilization and assume thatthe number and size of SMs are ﬁxed. However, sharingresources among SMs is also important because appli-cations have varying scalability patterns depending onSM settings, but exploration of this design space haslargely been ignored in prior work.In this work, we present AMOEBA, a reconﬁgurablearchitecture to improve GPU resource utilization, per-formance, and energy eﬃciency. AMOEBA takes intoaccount several important application resources require-ments such as interconnect throughput, memory accesspatterns, and control divergence, before selecting anoptimal GPU scaling option. AMOEBA is a coarse-grained reconﬁgurable architecture to enable ﬂexible SMscaling at a low cost. It also explores heterogeneity ofSMs through dynamic fusing and splitting in order toaccommodate program divergence.We make the following contributions in this paper: • We investigate the GPU scaling problem under re-source bound. We identify the important factors de-termining whether an SM should be designed in a scale-up or scale-out fashion. Building upon the re-sults from this investigation, we propose a coarse-grained reconﬁgurable architecture that fuses the base-line scale-out SMs into larger scale-up SMs. This de-sign enables optimized resource utilization across SMboundaries. • We design an online controller that takes into ac-count an application’s dynamic behavior and makesreconﬁguration decisions accordingly. The controlleremploys a binary logistic regression model to predict

Figure 1: GPU architecture overview. application scalability with low cost. • We provide design details of the proposed reconﬁg-urable architecture. Our proposed architecture en-ables coarse-grained SM fusion, and it can providesupport for both scale-up and scale-out GPU conﬁg-urations. • We propose a scheme to split individual scale-up SMcores dynamically when program divergence causespipeline stalls.To the best of our knowledge, this is the ﬁrst paperto propose a reconﬁgurable GPU architecture that candynamically toggle between scale-up and scale-out op-tions, with the goal of maximizing resource utilization.

GPU execution model divides the total work space intoa grid and assigns a work item, also called a thread, towork on each portion of data. Each thread executes thesame set of instructions, and this enables parallel multi-threaded execution in an SIMT fashion. Each segmentof code loaded to a GPU is called a kernel. A groupof threads that execute a kernel concurrently is referredto as a workgroup, also called a thread-block or a co-operative thread-array (CTA). The total work space isdivided into blocks or CTAs, and threads within a givenCTA can communicate with each other.A high-level view of a GPU architecture is shown inFigure 1. A GPU consists of multiple compute units(CUs) or streaming multiprocessors (SMs), which areanalogous to CPU cores. Each SM contains fetch, de-code, execution and memory access logic, and theseunits collectively form a pipeline. There are severalcompute cores residing within each SM, and each com-pute core is a large, heavily-pipelined execution unitcapable of executing both integer and ﬂoating point op-erations. When a kernel is launched, each CTA is dis-patched to an SM and executes there until its comple-tion. A CTA is further divided into units called warps,also known as wavefronts. Typically, there are a largenumber of warps in ﬂight inside an SM so that memoryaccess latency can be masked by concurrent execution.There is a uniﬁed L2 cache that is coupled with memorycontrollers, while the global memory is oﬀ-chip. On-chip data communication is implemented through anon-chip network. igure 2: SM scaling trends (number of SMs vsnumbers of cores per SM) for NVIDIA’s GTXGPU family[28]. . The primary execution unit in a GPU is a compute core,and they are grouped into SMs that share resources suchas register ﬁle, local memory, and warp scheduler. Dueto the resource limitation, once the chip size is decided,the total number of compute cores is ﬁxed. Then, therearises an important scaling problem: how should com-pute cores and other resources be partitioned into SMs?

That is, should we opt for scale-up

SMs (by includingmore compute cores and resources into a smaller num-ber of SMs) or scale-out

SMs (by having more SMs withfewer cores and less resource inside)? The scaling ofconﬁguration of SMs is critical since it directly deter-mines the maximum parallelism among GPU threadsand impacts resource sharing and utilization.Figure 2 shows the scaling trends of NVIDIA’s GTXGPU family during the last 11 years. The number ofcomputing cores per SM can be used to represent thescaling in SM size. We observe that both size and num-ber of SMs are increased from 2008 to 2011. However,after 2011, the trends of SM size and SM count start topart their ways in opposite directions. This is becausewe are reaching the limit in terms of the total numberof computing cores that can be integrated into a chipdue to power and area constraints. Therefore, it is notpossible to scale up both the size and number of SMs;so, we either scale out or scale up, but not both, asshown in the ﬁgure. And, the most recent trend hasbeen scaling out since 2017. However, the question iswhether this trend of scaling out is sustainable for thefuture. And, if not, what is the optimal conﬁgurationfor the best performance and resource utilization?

As discussed above, warps execute in SMs and all threadsin one SM share GPU resources such as shared mem-ory, L1 cache, register ﬁle, warp scheduler, and inter-connect interface. Scaling of SM greatly inﬂuences theresource utilization and power-performance eﬃciency.Due to their diﬀerent characteristics and resource re-quirements, diﬀerent applications exhibit varying

SMscaling patterns. We start by investigating the scalingof multiple benchmarks, and the results are plotted in

Figure 3: Performance with SM scaling (a) witha mesh NoC (b) with a perfect NoC. (x-axis isthe number of SMs and y-axis is IPC normalizedto 16 SMs.)Figure 4: Memory access coalescing results withdiﬀerent GPU scaling options. Actual memoryaccess rate represents the memory accesses after coalescing. Here, we experiment with diﬀerentSM scaling options with 16, 25, 36, and 64 SMs.

Figure 3(a). In this experiment, we ﬁt the total amountof chip resources but vary the size and the number ofSMs. As can be observed, some applications beneﬁtfrom scaling out with smaller SMs (

CP, SC ), whileother applications beneﬁt from scale-up SMs (

MUM,RAY ). This result indicates that there is not a scalingsetting that beneﬁts all applications. Motivated by thisobservation, we next examine in detail the major fac-tors that determine an application’s performance withSM scaling. (1) NoC Eﬀect on SM Scaling.

GPU SMs areconnected to L2 cache and memory controllers througha network-on-chip (NoC). It has been shown that NoCis a bottleneck in GPU performance as the chip sizegrows [11]. This is due to the particular traﬃc patternexhibited by GPUs. Speciﬁcally, all SMs communicatewith the limited number of memory controllers on chip.As the total on-chip bandwidth is ﬁxed and is sharedby all SMs, more SMs means that each SM receives asmaller share of the network bandwidth. In addition, alarger network incurs longer delays due to increased hopcount and contention. As a result, there will be morenegative impact on the performance. We experimentedwith diﬀerent SM scaling options using a perfect NoC (with zero delay), and the results are plotted in Fig-ure 3(b). We can observe that when the NoC impact isremoved, more applications (e.g.,

LPS, AES, CP, and SC ) achieve better performance with scale out settings.This means that, for applications that are sensitive tothe on-chip network performance, performance will ul-timately degrade when we keep scaling out the SMs. (2) Memory Locality and SM Scaling. It hasbeen observed that memory resources inside an SM af- igure 5: Rate of shared data in L1 caches ofneighboring SMs.Figure 6: Control divergence caused stalls withdiﬀerent GPU scaling options. fect the performance of some applications. Some appli-cations may share data a lot among warps in one SM oramong L1 caches in diﬀerent SMs. In such cases, scalingup will improve the utilization of shared memory andL1 cache, then reduce accesses to memory outside ofan SM. GPUs employ a mechanism called memory coa-lescing to reduce data movements. The idea is to com-bine multiple memory accesses from a warp to the samecache line into a single transaction. Larger SMs can ex-ecute larger warps, and provide more opportunities formemory coalescing. We quantitatively characterize thecoalescing eﬀects in SMs with diﬀerent scaling settings,as shown in Figure 4. In this ﬁgure, the y-axis showsactual memory access percentage of all load and storeinstructions after coalescing. As shown in Figure 4, ascale up design with 16 SMs has much lower memoryaccesses compared to a scale out design with 64 SMs.That is, as far as coalescing is concerned, scale up SMsbring more beneﬁts than scale out SMs.In addition, recent GPU architectures combine datacache and shared memory functionality into a singlememory block to provide the best overall performance.This makes the actual L1 cache capacity several timeslarger when needed. For example, NVIDIA’s Volta [29]architecture has a combined capacity of 128 KB/SM,more than seven times larger than the GP100 data cache,and all of it are usable as a cache by programs that donot use shared memory. Considering this trend, we alsoinvestigated L1 data sharing among neighboring SMswith increased L1 capacity, and the results are plot-ted in Figure 5. As can be observed, some benchmarks(such as HW and ) exhibit around 10% sharingrate in the baseline conﬁguration. When L1 capacityis increased by two or four times, higher sharing rateis observed in most benchmarks that exhibit data shar-ing. This means that scaling up SMs by increasing theL1 capacity can eﬀectively reduce duplicated data andleads to more eﬃcient utilization of the L1 caches. (3) Control Divergence and SM Scaling. Re-cent GPU architectures allow individual threads to fol-low distinct program paths with control ﬂow on the SIMD pipeline. Control divergence occurs when threadsin the same warp take diﬀerent paths upon a condi-tional branch, which can lead to signiﬁcant performancedegradation because it increases pipeline stalls [30, 31].Even though various software techniques have been pro-posed to better schedule branch instructions [19, 20, 21],control divergence cannot be totally removed. We haveobserved the core inactivity caused by the control di-vergence, as shown in Figure 6. It can be seen fromthis plot that, for scale up SMs, pipeline stalls causedby branch instructions are much larger than scaling outSMs. In fact, for many benchmarks, the cores are stalledfor more than half of the time waiting for branch in-structions to be resolved. This is because, in largerSMs, the pipeline is wider than smaller SMs; as a re-sult, a pipeline stall causes more reduction in computa-tion parallelism. In this sense, applications with manycontrol instructions need to employ scale out SMs forbetter performance.

Ideally, threads running on GPUs are able to executein a lock step fashion, and consume continuous com-putation enabled by warp scheduling to avoid pipelinestalls. However, it has been shown that, for some appli-cations, control ﬂow divergence and memory divergenceinside warps can signiﬁcantly degrade performance bycausing stalls in SM pipelines [30, 31, 32]. Memorydivergence occurs when threads from a single warp ex-perience diﬀerent memory-reference latencies caused bycache misses or accessing diﬀerent DRAM banks. Incurrent organizations, the entire warp must wait untilthe last thread to have its reference satisﬁed. To solvethis problem, several techniques have been proposed todivide a warp into smaller slices and regroup them tocreate a new warp so that divergent threads do not pre-vent other threads from proceeding in execution [23, 16,33]. However, to our knowledge, all existing work sub-divides a warp and reorganizes the threads to build anew warp to run on the ”same sized” SM.There is a signiﬁcant drawback of the above men-tioned techniques when implementing variable warp sizes:the SM needs to be subdivided to support the executionof a gang of split warps. For example, the gang-warp[16] needs to divide an SM into four slices and eachslice works, after splitting, as a small SM. There areprohibitive hardware overhead and design complexityissues in this type of approach. In addition, prior workonly considers resource utilization inside an SM, butnot across SMs. In contrast, we consider sharing amongSMs at a larger granularity. We also consider resourcessuch as NoC, L1 sharing, and coalescing among SMs,which have never been explored by prior work.In our proposed approach, we ﬁrst observe the appli-cation’s scalability with SM resources such as networkand memory. If we detect that the application worksbetter with scale up cores, we fuse two small SMs intoone big SM. However, such a scheme fuses all SMs stati-cally and may not ﬂexibly adapt to a program’s dynamicdivergence. For example, some control and memory di-ergence between the threads inside a warp may causelong stalls in the fused SM since the pipeline is muchwider now. Based on this observation, we propose to dy-namically split the scaled-up SM into two smaller SMsto handle the control and memory divergence within awarp. Once we detect that the divergence no longerexists, we fuse the two SMs back to one.Since the fused SM already consists of two sets ofexecution paths, there is no extra hardware needed tosupport slicing, as opposed to the prior work [23]. Itneeds to be noted that, we dynamically split and fuseSMs independently in this scheme. Fusing and splittingdecisions are made based on the current warp’s runningstatus, locally on each SM. As a result, when using ourapproach, at any given time during execution, the GPUarchitecture can have two types of SMs: some (fused)big SMs and some (split) SMs. Through this type of dy-namic heterogeneity , we are able to further improve re-source utilization and achieve better performance, overstate of the art.

Usually, reconﬁgurable architectures involve redesign-ing micro-architecture units, and this may lead to sig-niﬁcant overhead if not handled carefully. Due to thisreason, there have not been many reconﬁgurable CPUarchitectures proposed in the past. However, in the caseof GPUs, the reconﬁguration overhead can be muchlower. This is because GPU SMs have much simplerstructure and control logic, compared to general out-of-order CPU cores. Speciﬁcally, a GPU has a verysimple in-order pipeline which reduces the reconﬁgu-ration complexity. In addition, GPUs are designed tohide memory latency by overlapping the execution ofa large number of threads. As a result, delays causedby reconﬁguration can be conveniently masked. Thismakes GPUs excellent candidates for reconﬁgurable ar-chitectures. Reconﬁguration overhead also heavily de-pends on the granularity at which reconﬁguration takesplace. In this work, we propose a coarse-grained recon-ﬁgurable architecture based on SMs, which can furtherreduce design complexity and overhead. Speciﬁcally, weonly reconﬁgure GPUs at an SM level without modify-ing pipeline structures. We only modify a few managedresources such as warp queues, L1 cache, and registerﬁles. Therefore, the proposed GPU architecture is veryamenable to reconﬁguration.

The goal of our design is to reduce resource under-utilization and also improve performance. To reducethe design complexity, we opt for coarse-grain reconﬁg-uration. Since it has been shown that individual kernelsexhibit regular behavior, we propose a one-time recon-ﬁguration scheme on a kernel-by-kernel basis. Once akernel is determined to beneﬁt from scale up SMs, wefuse every two neighboring

SMs to create scale up SMs.Otherwise, we continue executing the kernel using scaleout SMs. Our method is basically a top-down approach:

Figure 7: Reconﬁguration controller overview. we ﬁrst characterize the kernel’s overall scaling behav-ior regarding overall GPU resource utilization and thenmake a decision regarding whether to fuse or not. Basedon this static fusion scheme, we also propose to reﬁnethe mechanism by allowing individual fused SMs to splitdynamically if warps exhibit signiﬁcant divergence inthe fused SM.

A high level view of our reconﬁguration controller isshown in Figure 7. Proﬁling has been employed bymany resource utilization techniques to determine anapplication’s characteristics [16, 24, 9]. In this work,we propose to combine online proﬁling with an oﬄinetrained model to predict scalability. When a new ker-nel starts, we ﬁrst evaluate various metrics regardingits execution. Then, these metrics are fed into a scal-ability predictor which is already trained oﬄine. Thescalability predictor gives a result indicating whetherthe kernel should be executed on scale up or scale outSMs. Next, we reconﬁgure the SMs according to thisresult and start executing the kernel. After the kernelﬁnishes, we start the loop again for the next kernel.

It has been shown that kernels exhibit disparate behav-ior with SM scalability and resource utilization [12, 34].Therefore, we cannot proﬁle kernels to predict the be-havior of an entire application. Recall however that,each kernel is split into smaller blocks, called CTAs,that execute similar portions of the code. We foundthat the CTAs exhibit very consistent behavior, whichclosely follows the scalability trend at the kernel granu-larity. Figure 8 shows how CTAs follow the same scal-ing trend with their kernel using applications

LIB and

RAY . As can be observed, both the kernel and CTAs of

RAY show a scale up trend, whereas

LIB kernel and itsCTAs exhibit a scale out trend. Therefore, we proposeto use a CTA to predict the scaling behavior of a kernel.

To proﬁle an application’s scalability respect to the SMsize and number, we need to identify metrics that caninﬂuence the scalability. Following are the major met-rics we considered in this work:1 (cid:13)

NoC throughput: This metric reﬂects the applica-tion’s ”communication intensity”. If the NoC is a bottle-neck, choosing scaled up cores will improve performancebecause the SM count would be smaller and the networksize would accordingly be smaller, resulting in each corehaving a higher network throughput.2 (cid:13)

Average NoC latency: This is the average latencyof the packets. It can also be used to evaluate the com-munication intensity. 3 (cid:13)

Coalescing rate: The coalesc-ing rate is calculated as the number of actual mem-ory accesses sent out from each SM divided by the igure 8: Kernel and CTA scalability consis-tency. total number of memory accesses in the instructions.This metric reﬂects how much shared data are requestedacross warps in an SM. 4 (cid:13)

L1 cache miss rate: This re-ﬂects the demand for an application on local memory. Ifthe miss rate is high and the data is not streaming, allo-cating a larger L1 will improve the performance, whichmeans scale-up SMs are expected to have better perfor-mance. 5 (cid:13)

MSHR rate: This metric is similar to thecoalescing rate, but it is across diﬀerent instructions.Scale up SMs will have more instructions running onthe ﬂy, and this will beneﬁt the applications with higherMSHR rates. 6 (cid:13)

Inactive thread rate: This is used toreﬂect the warp control divergence. It is calculated asthe number of cycles threads spent idling due to controlinstructions, divided by the total execution cycles. Ker-nels with larger control divergence would favor scale-outSMs.

In this work, we propose to use binary logistic regres-sion , which is a machine learning technique borrowedfrom the ﬁeld of statistics, to predict scalability. Ourmodel accepts several input parameters and generates abinary output indicating whether an application needsto be run with scaled up GPUs or scaled out GPUs.Since we only fuse two neighboring SMs to build a scaleup core, we only need a simple regression based modelto predict scalability. The output of the model is onlyBinomial: yes or no to scale up.Binary logistic regression estimates the probabilitythat a characteristic is present (e.g., estimating the prob-ability of ”success”), given the values of explanatoryvariables. Unlike the normal distribution, the mean andvariance of the Binomial distribution are not indepen-dent. Speciﬁcally, the mean is denoted by P and thevariance is denoted by P ∗ (1 − P ) /n , where n is thenumber of observations, and P is the probability of theevent occurring (i.e. whether we need to reconﬁguresmaller SMs into bigger SMs) in any one trial. If wewere considering the data in a list rather than a tableform, we would assume that the variable had a mean P ,and a variance P ∗ (1 − P ), and this variable would havea Bernoulli distribution. When we have a proportionas a response, we use a logistic or logit transformationto link the dependent variable to the set of explanatoryvariables. The logit link has the form: Logit ( P ) = log[ P/ (1 − P )] . (1)The term within the square brackets is the odds of anevent occurring. In our case, it indicates whether weneed to conﬁgure bigger cores. Using the logit scalechanges the scale of a proportion to plus and minus inﬁnity, and also because of Logit ( P ) = 0, when P =0.5. When we transform our results back from the logit(log odds) scale to the original probability scale, ourpredicted values will always be at least 0 and at most1. If there is only one input x , then we can write themodel as: P = e ( b + b x ) e ( b + b x ) , (2)where y is the predicted output, b is the bias or inter-cept term, and b is the coeﬃcient for the single inputvalue ( x ). We can write the model in terms of odds as: P − P = e b + b x . (3)Conversely, the probability of the outcome not occur-ring is 1 − P = 11 + e b + b x . (4)For an event with multiple input factors, the modeledlogarithm of the chance is given by:log( P − P ) = b + b x + b x + ... + b n x n + constant, (5)where P indicates the probability of an event (e.g.,chance to scale up by fusing SMs in our case), and P i are the regression coeﬃcients associated with the refer-ence group and the x i explanatory variables. We trainthis binary logistic model using a large amount of oﬄineexperimental data to obtain the values of b − b n . Wethen use this model to directly infer the fusing decisiononline. Since the model is in fact linear, its implemen-tation overhead is quite low. We give more details ofthe overheads in later sections. The goal of AMOEBA is to create a GPU architecturethat can dynamically change the number and size of itsSMs, based on run-time workload behavior. We proposeto start with a ”baseline” scale out machine and fuse theneighboring SMs into a bigger SM, if the application isfound to perform better with scaling out. Note that weallow fusing only two neighboring

SMs. This is due tothe following considerations: (1) Our scale out SM has32 SIMD units and a scaled up SM will have 64 SIMDunits when two SMs are fused. Fusing more SMs wouldsigniﬁcantly increase the pipeline width and the prob-ability of pipeline stalls. In the future, if the scale outSM gets even smaller, for example, with 16 SIMD units,then fusing 4 such SMs together will be a more viableoption. Note that our techniques can be easily extendedto fusing more SMs to scale up. (2) Because fused SMsshare resources such as L1 cache, register ﬁles, and warpschedulers, fusing more SMs means increased commu-nication latency and implementation complexity. Forexample, a larger L1 cache will need a longer accesstime which will compromise the potential beneﬁt fromthe SM fusion. Due to these reasons, we only considerfusing two neighboring SMs in this paper. igure 9: SM reconﬁguration via fusion.

Figure 9 shows how two scaled out SMs are fused tocreate a scaled up SM. The dashed lines show the fusedunits of the two SMs, placed to ensure that they canwork in a lockstep fashion as one SM. The shaded com-ponents in SM1 are disabled due to SM fusion. In thefused SM, instructions are ﬁrst fetched from the fusedL1 I-cache ( ). Then, the instructions are decoded,and selected instructions are sent to the per warp I-buﬀers ( ). Next, the control logic ( ) decides whichinstruction to issue and the decision is sent to the issueunit ( ). Selected warps are then sent to the datapathof both SM0 ( ) and SM1 ( ) for execution. Mem-ory accesses are sent from the executing threads to thefused memory unit ( ).In Figure 9, there are two baseline SMs, shown asSM0 and SM1. AMOEBA does not change the execu-tion units such as SP or SFU. When fused, the regis-ter ﬁles of the two original SMs and score boards workindependently, as in the baseline. AMOEBA does notchange register ﬁles, and since the register ﬁles, are allo-cated with warps, they are not fused but can be accessedindependently. Thus, there is no change in the through-put of any individual register ﬁle. Similarly, the scoreboard connection with each register ﬁle is not modiﬁedeither. However, the connection of the score board inSM1 to the warp scheduler is removed when two SMsare fused ( ). Instead, this score board is connectedto the warp scheduler of SM0. This is because when wefuse two SMs, only one warp scheduler is kept, and itschedules all warps on both the SMs ( ).The memory components of the two SMs need tobe fused, and this includes the shared memories, L1I caches, L1 D caches, and L1 context cache. We fuseL1 caches by increasing the cache associativity. To re-duce the new L1 cache access latency, the SM layoutneeds to be modiﬁed as shown in Figure 9, so that theL1 caches of both SMs are placed next to each other( , ). Since the GPUs are good at hiding memoryaccess latencies through overlapped warp execution, theextra delay caused by accessing a larger L1 D cache canbe hidden by warp computation. In our experiments,we conservatively added one extra cycle in L1 cache ac-cess due to the cache fusion. Our results show thatthis extra delay is hidden quite well by the overlappedcomputation.Each fused SM has one copy of the coalescing unit in Figure 10: Mechanism for switching betweenfusing and splitting.Figure 11: Algorithm to dynamically split afused SM to accommodate warp heterogeneity. the fused core by fusing the two coalescing units fromboth the SMs. Since the warp size is doubled, this leadsto more chances for coalesced memory accesses. Af-ter fusing the two SMs, AMOEBA combines the NoCrouters of the two SMs into one by disabling one SM’srouter. This is implemented by adding a bypass pathin one disabled router. As a result, the network sizeis reduced, this signiﬁcantly reduces the network la-tency, and consequently, each router can enjoy a higherthroughput in the network.

We propose to fuse SMs to reconﬁgure the GPU as ascaled up architecture when we observe that fusing theresources of two SMs is beneﬁcial from a performanceangle. It needs to be noted that our approach is dif-ferent from prior works such as variable warps [30] orwarp subdivision [33]. Those works only consider theresources inside an SM and try to fully utilize them –here is no cross-SM resource utilization optimizationin those prior studies. Our proposed architecture, onthe other, hand takes into account cross-SM resourceutilization, such as NoC resource, sharing L1 caches be-tween SMs, and memory access coalescing across SMs.As a result, it is fundamentally diﬀerent from the earlierworks.However, there are still opportunities to further im-prove resource utilization in AMOEBA. This is, whenwe fuse two SMs, there can be scenarios where warpheterogeneity can cause ineﬃcient pipeline utilization.For example, even though fusing two SMs can bringbeneﬁts in cache access or NoC, the resulting largerwarp size creates wider pipelines. In this case, diver-gence in memory or control behavior in warps could leadto more pipeline stalls, compared to the unfused SMs.Therefore, we propose a dynamic SM splitting strategy:when we observe a signiﬁcant warp divergence, and widepipeline leads to a higher performance degradation com-pared to the beneﬁts from fusion, we split the fused SMinto two separate SMs. In this way, each split SM hashalf the pipeline width and the warps that cause diver-gence can only cause stalls in one of the smaller SMs.The other SM can keep the computation without beingdelayed by the pipeline stalls.We can have diﬀerent policies to decide when to splita given ”fused” SM into two independent ones. Notethat, by ”independent”, we mean that two SMs are run-ning diﬀerent warps independently on their respectivedata paths. However, to reduce the cost of hardwareand context switch, we do not split the shared resources,such as L1 cache, register ﬁles, and NoC interface. Weset up a threshold to decide when to split, which is aﬁxed ratio of divergent warps to the total warps runningin the large SM. If the current ratio is greater than thethreshold, we decide to split the SM into two. This ﬁg-ure also shows how NoC interfaces are bypassed whentwo SMs are fused together.After the SM splits, we move all divergent warps fromthe bin to a new SM created from the split. Subse-quently, the two SMs start the independent executionof their warps. When the second SM ﬁnishes all diver-gent warps, we re-fuse the two SMs into one. Then,we start the procedure to collect divergent warps againand split the SMs when necessary. Thus, this proce-dure of splitting and fusing is dynamically decided bythe divergence of warps. This mechanism is expectedto maximize resource utilization and reduce stalls in thefused SMs.The idea behind splitting is to prevent divergent warpsfrom causing pipeline stalls. So, we need to separate di-vergent warps and non-divergent warps into two clustersand run each cluster on a separate smaller SM, so thatthe slow warps do not cause the fast warps to delay.Suppose that we have split a scale up SM into 2 scaleout SMs (SM 0 and SM 1), and then, we want to runfast warps on SM 0 and slow warps on SM 1. Therecan be diﬀerent mechanisms that can be used to decidewhat warps to be moved to the second SM 1. In thiswork, we investigated two methods: (1) direct split , Figure 12: Performance results. and (2) warp regrouping . The direct split method issimple as it directly divides a divergent warp in the mid-dle into 2 smaller warps. Then, both the smaller warpsare moved to SM 1. This method has a low cost butmay not have optimal performance. This is because theslow threads in a divergent warp may be located in dif-ferent positions. If we simply cut the warp in half, therecan be varying combinations of resulting warps. For ex-ample, we can have one warp with all fast threads andone warp with all slow threads. Or, we can have bothsmaller warps with partially slow threads. The idealcase is the ﬁrst splitting, since we can better removenegative eﬀects of the slow threads on the fast ones.Based on this analysis, we propose a second methodthat regroups threads into a fast warp and a slow warp.We then move the slow warp to SM 1 and keep the fastwarp in SM 0. To accomplish this, we ﬁrst divide thethreads in the original warp into small groups, and labelthem as ”fast” or ”slow” based on divergence. Then, weregroup them into two warps so that the slowest groupsare all put into a slow warp and moved to SM 1. In ourdesign, we also periodically check the stalls in the slowSMs. We periodically move some fast warps to them sothat the resources are not wasted when the slow warpscause stalls.The hardware overhead of the splitting is low becausethe split SMs were anyway two independent SMs in thebaseline architecture. We added hardware to fuse themas described earlier, and splitting them does not needextra hardware, except the management and storage ofthe divergent warps. Therefore, we need a new warpqueue and some simple control logic. Compared to theprior works [30, 33] that proposed splitting resourcesinside one SM, our overhead is very low. Figure 10 andFigure 11 show the timing and algorithm of our dynamicsplitting and fusing.

Table 1: System conﬁguration. See GPGPU-Sim v3.2.2 [35] for the full list.

Number of Computing Cores 48 coresNumber of Memory Controllers 8MSHR per Core 64Warp Size 32SIMD Pipeline Width 8Number of Threads per Core 1024Number of CTAs/Core 8Constant Cache Size/Core 8KBTexture Cache Size/Core 8KBL1 Cache Size/Core 16KBL2 Cache Size/Core 128KBNumber of Registers/Core 16384Warp Scheduler Greedy-Then-OldestShared Memory 48 KBMemory Scheduler FR-FCFSMemory Model 8 MCs, 924 MHzNoC Channel Width 128 bitNoC Topology meshNoC Router Pipeline Stage 2 igure 13: Control divergence caused stalls.

We simulate our baseline architecture using a cycle-levelsimulator (GPGPU-Sim [36]) and faithfully model allkey parameters (Table 1). The baseline GPU consistsof 48 scale out SMs with a warp size of 32. There are8 memory controllers on the chip. The interconnectionnetwork is a mesh-based NoC. There are two subnetsto avoid deadlock between request and reply messages.The router has a pipeline with 2 stages. When we per-form reconﬁguration, two baseline SMs are fused to cre-ate one scale up SM. We use a wide range of GPU ap-plications from Ispass [37], Rodinia [38], Polybench [39]and Mars [40], to evaluate our design, and execute allapplications to completion. We report performance re-sults using the geometric mean of IPC speedup (over thebaseline GPU). We also report other evaluation metricsprovided by the simulator such as L1 cache miss rate,NoC latency, network injection rate, and SM idle rate.

Figure 12 illustrates the performance gains when usingAMOEBA. The baseline is a scale out architecture andwe also experiment with direct scale up. We present theperformance of applying three techniques proposed byAMOEBA: static fuse conﬁgures the SMs only once be-fore a kernel’s execution. Using the prediction model,AMOEBA predicts the scalability of application withSMs, and chooses to fuse two SMs or not. The next twotechniques are based on the dynamic heterogeneous SMscaling. Direct split simply divides a divergent warpinto smaller ones in the middle, whereas warp regroup-ing employs more complicated techniques to re-organizethreads into a fast warp and a slow warp. As can be ob-served, SM achieves the highest improvement in perfor-mance, by 4.25 times. MUM also achieves a signiﬁcantperformance improvement of 2.11 times. On average,all 12 benchmarks have around 47% increase in IPC.Static fuse achieves almost same performance as di-rect scale up when larger SMs can bring performancebeneﬁts. For applications that can beneﬁt from largerSMs, static fuse achieves almost the same performancegain as direct scale up. However, some benchmarks pre-fer scale out conﬁgurations, such as and

ATAX .Our fusing techniques all perform better than directscale up (about 10%) for these workloads. This showsthat AMOEBA can accurately predict the applications’scalability and the correct reconﬁguration can lead toperformance gains. Some workloads are not sensitiveto scaling such as

FWT and KM and all AMOEBAtechniques perform similar to the baseline. In gen-eral, direct split and static fuse bring similar beneﬁts(on average) for most workloads, except BFS and SM .Some workloads such as WP even experience perfor- Figure 14: L1-I cache miss rate.Figure 15: L1-D cache miss rate. mance degradation, which is mainly due to the fusionoverhead. This is because this technique cannot dynam-ically react to workload behavior changes. On the con-trary, warp regrouping achieves 16% performance gainthan direct split because it can accurately capture aworkload’s dynamic behavior caused by divergence.

Figure 13 plots the SM inactive rate caused by con-trol divergence which is deﬁned as the fraction of cyclesthat SMs are stalled due to control instructions. Wecan observe that only part of the workloads suﬀer fromstalls caused by control divergence. For workloads thathave control divergence caused stalls, dynamic fusionperform better than direct scale up and static fusingbecause they can dynamically adjust to the changes incontrol divergence. Warp regrouping performs better inmore cases than direct split because fast and slow warpsare allocated to diﬀerent SMs. Among all cases, thebaseline scale out conﬁguration has the least amount ofstalls because its pipeline width is always smaller thanthe other conﬁgurations.

L1-I cache miss rate is plotted in Figure 14. Somebenchmarks such as

FWT , , and ATAX are notsensitive to L1-I cache capacity and fusing does not leadto any change in their behavior. However, most bench-marks have their miss rates reduced and the averagereduction is 9%, 20% and 30% for the three AMOEBAschemes. Sharing L1-I cache through SM fusion reducesthe I cache misses and thus leads to improved perfor-mance. Figure 15 plots the miss rate of L1-D cache. Themost signiﬁcant reduction is for SM and its miss rate isreduced by more than 70%. This is because the shar-ing of L1 cache increases its eﬀective capacity and thischange directly leads to 4.25 times improvement in per-formance. Some benchmarks, such as BFS and

MUM ,experience increased L1-D cache miss rates. This is be-cause warp regrouping changes data locality by movingwarps between SMs and this leads to higher miss rates.Impact of AMOEBA on memory accesses is plotted inFigure 16. As can be observed, all benchmarks achievereduced actual memory access rates compared to thebaseline. Actual memory access rate is calculated asthe actual memory access count divided by the totalnumber of memory accesses in the instructions. SinceAMOEBA allows SMs to share coalescing units, the ac-tual number of loads and stores is greatly reduced. igure 16: Actual memory access.Figure 17: Normalized rate of stalls when MCscannot inject to the NoC.

Figure 17 plots the normalized ICNT stall rate, whichis deﬁned as the rate of stalls when new reply packetscannot be generated because an MC’s injection queuesare full. This data can reﬂect the pressure on bothNoC and memory controllers. As can be observed fromthis ﬁgure, all AMOEBA schemes are able to reducethis stall rate. For some benchmarks, such as

CORR and

COVR , this stall time is totally removed. SinceAMOEBA can fuse SMs and bypass some routers, thenetwork size is reduced, and this leads to smaller hopcounts. As a result, NoC bottleneck can be greatly re-lieved for communication-intensive applications. Fig-ure 18 shows the average network data injection ratesfor the SM conﬁgurations evaluated. As can be observedfrom this plot, all benchmarks have a higher injectionrate under the AMOEBA than the baseline. This isbecause we fuse SMs and use only one NoC network in-terface to inject packets. Even though the injection rateis higher in AMOEBA schemes, the network size is re-duced by half and this leads to shorter communicationdelays, paving the way to achieve better performance.

To observe the dynamics of switching between fusingand splitting, we studied the status of ﬁve SMs in bench-mark

RAY . The results are shown in Figure 19. Asshown in this ﬁgure, all 5 SMs start with fused execu-tion because this benchmark favor scale up SMs. Aftera period of time, the SMs start to split because enoughdivergent warps have been detected and smaller SMsbrings more beneﬁt. However, the switching betweenfusing and splitting of each SM is independent of eachother. As a result, at a certain time, there exist bothscale up and scale out SMs in the architecture. As a re-sult, better performance results are achieved from thisﬂexible heterogeneity in SM conﬁgurations provided byAMOEBA.

We use several performance counters to generate thedetailed metrics required by our scalability predictionmodel. Most of these performance counters are alreadyincluded in many of today’s GPU systems, includingcache hits and misses, MSHR, and branch instructionstatistics. For metrics cannot currently be provided bythe performance counters, we propose to add such coun-

Figure 18: NoC injection rate.Figure 19: Phases of dynamic SM fusion andsplitting. ters, e.g., concurrent CTA numbers. Table 2 shows thecoeﬃcients in our scalaiblity prediction model.To analyze the relative contribution of each metricto overall performance in the prediction model, we plotthe distributed weights of the major metrics. Here, weconsolidate diﬀerent types of L1 cache miss rate intoone metric called

L1 miss rate . The result is shown inFigure 20. For each metric, its magnitude of impact isshown as a value between -1 to 1. The magnitude of im-pact of a metric is calculated as the coeﬃcient of thismetric × measured value . For example, the impact mag-nitude of Load instruction is calculated as Load insn rate × its coeﬃcient . All positive impact magnitudes con-tribute to a scaling up decision, and all negative impactmagnitudes contribute to a scaling out decision. Even-tually we add all metrics’ impact magnitudes togetherand check the sum. If the result is positive, then wepredict to fuse SMs and create a scale up conﬁguration.Otherwise, we predict that a scale out conﬁguration willﬁt better with the application. In this ﬁgure, the sumof the impact magnitudes for BFS and

RAY are bothpositive. So, these benchmarks favor running on scaleup SMs. On the contrary, CP and PR prefer to run onscale out SMs. It can also be observed that diﬀerent ap-plications’ scalability is inﬂuenced by diﬀerent metricswith varying extent. For example, MSHR plays a moresigniﬁcant role in BFS and CP , whereas PR and RAY are more sensitive to the NoC performance comparedto others.

We now compare the performance of AMOEBA againstDynamic Warp Subdivision (DWS) [33]. The resultsare plotted in Figure 21. DWS was proposed by Menget. al. to divide warps into smaller ones in order toreduce the stalls caused by memory and control diver-gence. On average, AMOEBA achieves 27% perfor-mance gain over DWS. Benchmark SM achieves 3.97times improvement in performance compared to DWS.This is because DWS can only improve resource utiliza-tion inside an SM and cannot harness the beneﬁts ofcross-SM resource sharing. In contrast, AMOEBA candynamically change the conﬁgurations of SMs and thusﬂexibly allow resources to be shared among SMs. Thus, able 2: Coeﬃcients in scalability predictionmodel. Constant -73.635 Concurrent cta 1.414Control Diver-gent 444.628 Coalescing 2057.050L1D Miss Rate -313.838 L1I Miss Rate 1674.513L1C Miss Rate -67.277 MSHR -102.971Load Inst Rate -680.786 Store Inst rate -804.7NoC -8.301

Figure 20: Magnitude of parameter impact ondetermining scalability for some applications us-ing the proposed predictor. performance can be further improved through enhancedresource utilization.

There are two types of controllers in the proposed archi-tecture: online reconﬁguration controller for scale up orscale out, and switch controller for dynamic fusing andsplitting. We propose to implement these controllers inan IP module in the GPU chip. The major componentsin the controllers are a MAC unit, buﬀers and controllogic. We employ similar methods proposed in [32] tomodel the buﬀers in the controllers by using the areaof a latch cell from the NanGate 45 nm Open Cell li-brary. The resulting area of each bit of the buﬀer is 4.2um , and the total estimated added buﬀer area is 0.021mm . We use a pipelined Booth Wallas MAC [41] andit is synthesized by Synopsis Design Compiler using 90nm technology and scaled to 45 nm. The area of theMAC is 0.019 mm . Together with the control logic,we estimate the two controllers to have area of 1.53mm . GeForce 8800GTX which has 128 SM cores, theoverall area overhead of AMOEBA can be calculatedas the total SM area overhead + controller overhead =0.021 mm ×

128 + 1.52 mm = 4.208 mm . Com-pared to the total GeForce 8800GTX area of 480 mm ,AMOEBA incurs an area overhead of 0.88%. There has been plenty of work proposing reconﬁgurablearchitectures for multi-core CPU systems [25, 26, 27,24, 42]. A multicore architecture is proposed in [25]that reconﬁgures cores into a wide VLIW machine toexploit hybrid forms of parallelism. As a pioneer re-conﬁgurable architecture, TRIPS [26] splits ultra-largecores to small ones to meet the diverse demand of appli-cation parallelism. Working in the opposite direction ofreconﬁguring cores, Ipek et al. [24] proposed Core Fu-sion where a large core can be dynamically conﬁguredfrom a group of independent smaller cores. Core Fusionis the most closely related work to AMOEBA, but it isproposed for CPU cores and the core fusing policy andmicro architecture are very diﬀerent from our work.

Figure 21: Comparison with Dynamic WarpSubdivision (DWS) [33].

Compared to CPU based multicore systems, therehave been fewer works on reconﬁgurable GPU architec-tures. Voitsechov et al. proposed SGMF, a dataﬂow ar-chitecture using coarse-grain reconﬁgurale fabric, com-posed of a grid of interconnected functional units [13].However, SGMF needs help from compiler to break theCUDA/OpenCL kernels into dataﬂow graphs and in-tegrates the control ﬂow of the original kernel to pro-duce a control-data-ﬂow-graph (CDFG). Diﬀerent fromtheir work, our proposed scheme does not require com-piler support. R-GPU is a reconﬁgurable GPU archi-tecture that aims to reduce the cycles spent on datamovement and control instructions and focus on data-computations [14]. It conﬁgures GPU cores to createa spatial computing architecture. R-GPU implementsreconﬁguration at a core level within an SM, and doesnot consider an application’s scalability while our workreconﬁgures at an

SM level , and our reconﬁguration de-cision is based on NoC, control instructions and mem-ory access patterns. Dhar et al. proposed ﬁne grainedand coarse grained reconﬁgurations of SMs in GPUs inorder to reduce the underutilization of resources andpower consumption [15]. However, their work only re-conﬁgures the datapath inside each SM. Our work alsoreconﬁgures memory and NoC of the system, and wealso propose to use heterogeneous SMs to improve per-formance and power eﬃciency.Heterogeneous multicores have emerged as a promis-ing approach for CPU-based systems which leveragecores with diﬀerent capabilities and complexities to strikea balance between performance and power [9, 43, 44,45, 46, 47, 48]. Lukefahr et al. propose compositecores that consist of big and small compute engines[9]. Kumar et al. [44] proposed a heterogeneous multi-core architecture to reduce power dissipation. Hill etal. showed that there is great potential in performanceimprovement of the serial sections of an application us-ing heterogeneous cores [43]. Our proposed AMOEBAarchitecture diﬀers from these heterogeneous architec-tures in that our heterogeneous cores are dynamicallyconﬁgurable while these earlier works employ ﬁxed coreconﬁgurations. Our design can provide more ﬂexibilityin exploring heterogeneous architectures, and achievebetter resource utilization.Recently, several approaches have been proposed forimproving GPU resource utilization [12, 16, 17, 18, 49,50]. Wang et al. propose Simultaneous Multikernel(SMK) by exploiting heterogeneity of diﬀerent kernels[12]. Park et al. proposed GPU Maestro that per-forms dynamic resource allocation for eﬃcient utiliza-tion of multitasking GPUs [16]. Wang et al. proposean application-aware TLP management techniques fora multi-application execution environment in order tomake judicious use of shared resources [17]. To im-rove resource utilization in concurrent kernel execu-tion (CKE), Dai et al. proposed mechanisms to reducememory stalls [18]. Our proposed work is diﬀerent fromthese prior techniques because it reconﬁgures SMs sothat they scale according to the application’s dynamicbehaviour.

In this work, we propose a reconﬁgurable GPU archi-tecture, called AMOEBA, to explore the design space ofGPU scaling. By predicting a given application’s scal-ability with SM size, the proposed architecture is ableto dynamically conﬁgure scale up or scale out SMs inorder to achieve high performance and resource utiliza-tion. We also propose an optimization strategy to fur-ther reconﬁgure each SM based on the warp divergenceat run- time, resulting in a heterogeneous architecturein which both scale up and scale out SMs co-exist at run-time. Our evaluation results using various benchmarkprograms demonstrate the eﬀectiveness of AMOEBA inreducing GPU resource under-utilization and improvingsystem performance and power eﬃciency. [1]

Green500 list

Top500 list

Amazon Web Service . https://aws.amazon.com/ec2.[4] D. Luebke and G. Humphreys, “How gpus work,” in

Computer ( Volume: 40 , Issue: 2 , Feb. 2007) , 2007.[5] A. Prakash, H. Amrouch, M. Shaﬁque, T. Mitra, andJ. Henkel, “Improving mobile gaming performance throughcooperative cpu-gpu thermal management,” in

Proceedingsof 53nd ACM/EDAC/IEEE Design AutomationConference (DAC) , 2016.[6]

Nvidia . Programming Guide, 2014.[7] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov,O. Nutlu, and Y. N. Patt, “Improving gpu performance vialarge warps and two-level warp scheduling,” in

Proceedingsof 44th Annual International Symposium onMicroarchitecture , 2011.[8] J. J. K. Park, Y. Park, and S. Mahlke, “Elf: Maximizingmemory level parallelism for gpus with coordinated warpand fetch scheduling,” in

Proceedings of SC15 , 2015.[9] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman,R. Dreslinkski, T. F. Wenisch, and S. Mahlke, “Compositecores: Pushing heterogeneity into a core,” in

Proceedings ofthe 45th Annual International Symposium onMicroarchitecture , 2012.[10] R. Saleh, S. Wilton, S. Mirabbasi, A. Hu, M. Greenstreet,G. Lemieux, P. P. Pande, C. Grecu, and A. Ivanov,“System-on-chip: Reuse and integration,” in

Proceedings ofIEEE, Vol. 94, No. 6, June 2006 , 2006.[11] A. Bakhoda, J. Kim, and T. M. Aamodt,“Throughput-eﬀective on-chip networks for manycoreaccelerators,” in

Proceedings of 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture , 2010.[12] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, andM. Guo, “Simultaneous multikernel gpu: Multi-taskingthroughtput processors via ﬁne-grained sharing,” in

Proceedings of IEEE International Symposium on HighPerformance Computer Architecture (HPCA) , 2016.[13] D. Voitsechov and Y. Etsion, “Single-graph multiple ﬂows: Energy eﬃcient design alternative for gpgpus,” in

Proceedings of the 41st International Symposium onComputer Architecture (ISCA) , 2014.[14] G. V. D. Braak and H. Corporaal, “R-gpu: a reconﬁgurablegpu architecture,” in

ACM Transations on Architecture andCode Optimization, Vol.0, No. 0, Article 0 , 2015.[15] A. Dhar, “The case for reconﬁgurable general purpose gpucomputing,” in

Master Thesis, University of Illinois atUrbana-Champaign , 2014.[16] J. Park, Y. Park, and S. Mahlke, “Dynamic resourcemanagement for eﬃcient utilization of multitasking gpus,”in

Proceedings of ASPLOS , 2017.[17] H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog,“Eﬃcient and fair multi-programming in gpus via eﬀectivebandwidth management,” in

Proceedings of IEEEInternational Symposium on High Performance ComputerArchitecture (HPCA) , 2018.[18] H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, andH. Zhou, “Accelerate gpu concurrent kernel execution bymitigating memory pipeline stalls,” in

Proceedings of IEEEInternational Symposium on High Performance ComputerArchitecture (HPCA) , 2018.[19] C. basaran and K. D. Kang, “Supporting preemptive taskexecutions and memory copies in gpgpus,” in

Proceedings ofEuromicro Conference on Real-Time Systems , 2012.[20] S. Kato, K. Lakshmanan, R. R. Rajkumar, andY. Ishikawa, “Timegraph: Gpu scheduling for real-timemulti-tasking environments,” in

Proceedings of the 2011USENIX conference on USENIX annual technicalconference , 2011.[21] J. T. Adriaens, K. compton, N. S. Kim, and M. J. schutle,“The case for gpgpu spatial mutlitasking,” in

Proceedings ofthe 18th HPCA , 2012.[22] C. J. Rossback, J. currey, M. silberstein, B. Ray, andE. Witchel, “Ptask: Operating system abstrations tomanage gpus as compute devices,” in

Proceedings of the23rd ACM Symposium on Operating System Principles ,2011.[23] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram,“Warped-slicer: Eﬃcient intra-sm slicing through dynamicresource partitioning for gpu multiprogramming,” in

Proceedings of the 43rd Annual International Symposiumon Computer Architecture , 2016.[24] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Corefusion: Accommodating software diversity in chipmultiprocessors,” in

Proceedings of the InternationalSymposium on Computer Architecture (ISCA) , 2007.[25] S. A. Lieberman and S. A. Mahlke, “Extending multicorearchitectures to exploit hybrid parallelism in single-threadapplications,” in

Proceedings of HPCA , 2007.[26] K. sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting ilp,tlp and dlpp with the polymorphous trips architecture,” in

Proceedings of International Symposium on ComputerArchitecture , 2003.[27] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, andM. Horowitz, “Smart memories: a modular reconﬁgurablearchitecture,” in

Proceedings of International Symposiumon Computer Architecture volta-architecture-whitepaper. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.[30] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt,“Dynamic warp formation and scheduling for eﬃcient gpucontrol ﬂow,” in

Proceedings of 40th Internationalsymposium on Microarchitecture , 2007.[31] T. D. Han and T. S. Abdelrahman, “Reducing branchdivergence in gpu programs,” in

Proceedings of GPGPU-4workshop , 2011.32] T. Rogers, D. R. Johnson, M. O’Connor, and S. W.Keckler, “A variable warp size architecture,” in

Proceedingsof ISCA , 2015.[33] J. Meng, D. Tarjan, and K. Skadron, “Dynamic warpsubdivision for integrated branch and memory divergencetolerance,” in

Proceedings of ISCA , 2010.[34] A. Jadidi, M. Arjomand, M. Kandemir, and C. Das,“Optimizing energy consumption in gpus throughfeedback-driven cta scheduling,” in

Proceedings ofSpringSim (HPC) 2017: 12:1-12:12 , 2017.[35]

GPGPU-Sim v3.2.2 (2016) GTX 480 Conﬁguration. https://github.com/chenxuhao/gpgpu-sim-ndp/tree/master/conﬁgs/GTX480.[36] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, andT. M. Aamodt, “Analyzing cuda workloads using a detailedgpu simulator,” in

IEEE International Symposium onPerformance Analysis of Systems and Software , 2009.[37] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, andT. M. Aamodt, “Analyzing cuda workloads using a detailedgpu simulator,” in

IEEE International Symposium onPerformance Analysis of Systems and Software (ISPASS) ,2009.[38] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaﬀer,S. Lee, and K. Skadron, “Rodinia: A benchmark suite forheterogeneous computing,” in

IEEE InternationalSymposium on Workload Characterization (IISWC) , 2009.[39] S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, andJ. Cavazos, “Auto-tuning a high-level language targeted togpu codes,” in

Innovative Parallel Computing (InPar) ,2012.[40] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang,“Mars: A mapreduce framework on graphics processors,” in

International Conference on Parallel Architectures andCompilation Techniques (PACT) , 2008.[41] N. Kumar, M. Bansal, and A. Kaur, “Speed power and areaeﬃcent vlsi architectures of multiplier and accumulator,” in

International Journal of Scientiﬁc and EngineeringResearch Volume 4, Issue 1, January-2013 , 2013.[42] C. Kim, S. Sethumadhavan, M. S. Govindan,N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler,“Composable lightweight processors,” in

Proceedings fo theInternational Symposium on Microarchitecture , 2007.[43] M. Hill and M. Marty, “Amdahl’s law in the multicore era,”in

IEEE Computer, 41(7) , 2008.[44] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, andD. M. Tullsen, “Single-isa heterogeneous multi-corearchitectures: The potential for processor power reduction,”in

Proceedings of the International Symposium onMicroarchitecture , 2003.[45] P. Greenhalgh, “Big.little processing with arm cortex-a15i& cortex-a7,” in ,2011.[46] M. Annavaram, E. Grochowski, and J. Shen, “Mitigatingamdahl’s law through epi throttling,” in

Proceedings of the32nd International Symposium on Computer Architecture ,2005.[47] R. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “Theimpact of performance asymmetry in emerging multicorearchitectures,” in

Proceedings of the 32nd InternationalSymposium on Computer Architecture , 2005.[48] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt,“Accelerating critical section execution with asymmetricmulti-core architectures,” in

Proceedings of ASPLOS , 2009.[49] Y. Oh, G. Koo, M. Annavaram, and W. W. Ro,“Linebacker: Preserving victim cache lines in idle registerﬁles of gpus,” in

ISCA , 2019.[50] A. Pattnaik, X. Tang, O. Kayiran, A. Jog, A. Mishra,M. T. Kandemir, A. Sivasubramaniam, and C. R. Das,“Opportunistic computing in gpu architectures. inproceedings of the 46th international symposium on computer architecture,” in