A GPU Register File using Static Data Compression
aa r X i v : . [ c s . A R ] J un A GPU Register File using Static Data Compression
Alexandra Angerd, Erik Sintorn, and Per Stenstr¨om
Department of Computer Science and EngineeringChalmers University of TechnologyG¨oteborg, Sweden { angerd,erik.sintorn,per.stenstrom } @chalmers.se ABSTRACT
GPUs rely on large register files to unlock thread-level parallelismfor high throughput. Unfortunately, large register files are powerhungry, making it important to seek for new approaches to im-prove their utilization.This paper introduces a new register file organization for effi-cient register-packing of narrow integer and floating-point operandsdesigned to leverage on advances in static analysis. We show thatthe hardware/software co-designed register file organization yieldsa performance improvement of up to 79%, and 18.6%, on average,at a modest output-quality degradation.
Modern GPUs provide a high throughput by enabling massive thread-level parallelism (TLP). Large register files are needed to providefast context-switching between threads, and GPUs rely on everlarger register files in order to further increase thread-level par-allelism (TLP) [15, 25]. The Fermi architecture, a few generationsback, had a register file size of about 131 KB per streaming multi-processor (SM), consuming 13.4% of the total dynamic power [13].With 15 SMs, it sums up to about 2 MB register storage in total.In contemporary architectures, such as NVIDIA’s Turing, the totalregister file size sums up to approximately 18 MB [18]; a ninefoldincrease!TLP can be increased by leveraging a more efficient utilization ofthe register resources to decrease the per-thread register footprint.In a conventional register file, the operands are stored at a granu-larity of 32 bits. However, it has been shown that many operandsin GPU workloads require significantly less space. For example, alarge portion of the integer operands is narrow [7, 25]. Further-more, in the case of floating-point operands, their precision can besignificantly reduced with a negligible impact on quality [22], espe-cially if the operands are allowed to have different precisions [1].Previous work [2, 7, 25] propose register-file organizations whichpack operands by establishing the number of significant bits, i.e.the bitwidth , at run time using zero/one-detection logic to removeredundant sign extension bits. However, this approach does notwork for floating-point data. Still, it has been shown that the pre-cision of many floating-point operands can be reduced substan-tially [1] with negligible impact on the quality of the applicationoutput. In addition, the precision of a floating-point operand can-not be decided at run time , since it is not possible to know whatimpact reduction of precision of a specific operand has on the endresult.
Conference’17, Washington, DC, USA
In this paper, we propose, for the first time, a register file or-ganization capable of storing operands at a fine granularity re-gardless of the data type. Our evaluation shows that in order toachieve significant reduction of the register footprint, both inte-ger and floating-point data have to be considered. To the best ofour knowledge, our approach is the first to support mixed pack-ing of both integers and floating-point numbers. We do this byannotating the bitwidth needed by each operand at the instructionlevel, at compile time. To detect narrow integers, we leverage staticrange analysis [21]. To determine the precision of each float, weleverage a method that tunes the precision to meet a user-definedoutput quality threshold [1]. The annotations are taken into con-sideration to achieve a dense register allocation. At run time, ourproposed register file uses a configurable indirection table to storethe location of each operand.Our approach is inspired by the idea of a configurable indirec-tion table given by Angerd et al. [1]. They assume the existenceof a register file with support for low-precision floats. However,they do not consider any support for narrow integers. Further-more, they do not present any microarchitectural design of sucha register file, nor how it could be integrated into a GPU pipeline,not even for floats. Hence, how to design a compiler-assisted reg-ister file which supports both integer and floating-point data re-mains unsolved. In particular, a microarchitectural design of sucha register file organization implies a number of challenges, whichwe address in this paper: First, the indirection table needs to beconsulted for each and every register access. Since this access ison the critical path, it introduces latency which can have an ad-verse impact on performance. Second, the indirection table mustbe able to handle multiple accesses per cycle, since several registeraccesses have to be carried out simultaneously. Third, conversionsbetween different floating-point formats are necessary. We miti-gate these challenges by presenting an indirection table microar-chitecture which matches the throughput of the register file, aswell as conversion units capable of carrying out floating-point for-mat changes in one cycle. Our evaluation shows that our approachyields a performance improvement of up to 79% and 18.6%, on av-erage, compared to a register file which only supports an operandgranularity of 32 bits.Our contributions in this paper are the following: • A new register file organization for GPUs which, unlike previ-ously proposed solutions, is capable of supporting narrow operands regardless of data type (integer and floating point data). • A new concept for efficient packing of narrow operands, whichis built upon static bitwidth analysis co-designed with a newregister file concept. able 1: Register pressure, occupancy, and IPC of the base-line, when using either one or both parts of the framework,and when artificially increasing the occupancy. Register Pressure Occupancy IPCOriginal
52 21% 196
Narrow integers
46 - -
Narrow floats
36 - -
Narrow integers + floats
29 62.5% 352
Artificial occupancy increase
52 62.5% 377 • An evaluation of the proposed microarchitectural design, whichshows a performance improvement of up to 79%, and 18.6% onaverage, when allowing for a slight output quality loss.The rest of the paper is organized as follows. Section 2 providesa motivational example. Section 3 introduces the baseline GPUarchitecture and our proposed register file organization. Section 4presents the static approach we use to reduce operand bitwidths.Section 5 describes the methodology used to derive the results inSection 6. We discuss the implications on other architectures inSection 7. Finally, we put our work in the context of related workin Section 8 before we conclude in Section 9.
In this section, we show the performance improvement obtainedby increasing TLP through improved register file utilization for akernel called IMGVF, included in the Leukocyte application fromthe Rodinia benchmark suite [4]. The core idea is to reduce the reg-ister pressure , that is, the maximum number of fixed-size registersneeded by a thread, by reducing the bitwidth of the operands. Thisway, the register file can accommodate more threads. For IMGVF,the original register pressure is 52 registers.The kernel’s register pressure directly affects how many threadsare allocated to each GPU core. In the CUDA programming model,warps consist of 32 threads which are further bundled into blocks,whose size are kernel-specific and decided by the programmer. Theassignment of threads to a core is done at the granularity of blocks.A Fermi GPU has 32,768 registers per core, and can support up to48 warps to be active, simultaneously. However, the block size forIMGVF is ten warps with a register pressure of (52 x 32 x 10 =)16,640 registers. Hence, only one block fits in the register file; theregister usage severely limits the achievable TLP. We refer to thislimit as occupancy, i.e. the ratio of active warps to the maximumnumber of warps supported by the core.To reduce the register pressure and increase the occupancy, werun the application through the static analysis framework (describedin detail in Section 4). It consists of two parts: A static analysisframework 1) that based on [21] finds the required number of bitsfor each integer operand and 2) that based on [1] establishes theprecision for each floating-point operand given a user-specifiedquality metric and threshold. Here, we have specified the qualitymetric and threshold such that no deviation from the original out-put is allowed. Table 1 reports the original register pressure, wheneach framework is applied in isolation, and when both frameworksare used. When both frameworks are used, the register pressure islowered from 52 to 29 registers. As a result, three blocks (30 warps)can now fit into the register file simultaneously, as opposed to one !" % % !" (&)*+,(&)*+3 .’//01%’"+2)$%+,.’//01%’"+2)$%+34&/50+6"5)1&%’" % % ?&"@A: A);%"51%$’) G?!"
Figure 1: Baseline operand collector design with proposedextensions (in green). block in the original case. Hence, the occupancy is increased from21% to 62.5%.To confirm that an increase in occupancy would unlock moreTLP and higher performance, we run the application with a sim-ulated Fermi GPU, using GPGPU-Sim [3] (details are provided inSection 5), both using the original occupancy and the occupancyreached with our compression technique by increasing the size ofthe simulated register file. The result is presented in Table 1; witha higher occupancy, the Instructions Per Clock (IPC) is increasedby 91%.We also modify GPGPU-Sim to include the microarchitecturalstructures needed to implement the proposed register file usingan indirection table approach described in Section 3. As seen inTable 1, the increase in IPC is close to what can be achieved byartificially allowing for higher occupancy.
In this section, we first describe the baseline GPU in Section 3.1.Then, in Section 3.2, we describe the microarchitectural implemen-tation of the new register file organization.
Our baseline microarchitecture resembles NVIDIA’s Fermi archi-tecture [16] used in several recent studies [12, 13]. The threads ineach warp are executed in lockstep, but with different register val-ues in a SIMD-fashion. Hence, threads are scheduled to executionunits at warp granularity.The GPU has 15 cores called Streaming Multiprocessors (SMs),which share an L2 cache. Each SM has a private L1 cache, a lo-cal shared memory, and a dedicated texture cache. Each SM alsohas two warp schedulers, which together can schedule two instruc-tions from different warps simultaneously. One SM also comprisestwo Single Precision Units (SPUs), one Special Function Unit (SFU),and one memory (LD/ST) unit. All units execute at the granularityof one warp instruction (32 lock-stepped thread instructions ). TheSPUs execute all types of instructions except for built-in trigono-metric and logarithmic operations, which are executed using theSFUs. The LD/ST unit carries out memory operations.To hide idle cycles caused by hazards and memory access la-tency, the SM keeps a large number of warps in flight supporting " ! " ()*+" ’ ’ ’ ’ ’ ’ . ./ "!""!!!! !!!!!"!" ’ ’ ’ ’ ")*+,- ,./012.,3"! Figure 2: Each thread register is divided into eight slices.These are accessed through an indirection table. fast context-switches by a large register file. To provide a largebandwidth and the appearance of being multi-ported, the baselineregister file uses an operand collector (see Figure 1). The registerfile is split into 16 banks, each with 64 entries, 1024 bits wide, withone read port and one write port per bank. Because each warp isexecuted in lockstep, the registers are stored in vectors of 32 threadregisters , forming a warp register . A register file access applies toa warp register. To maximize throughput, an arbitrator is associ-ated with the operand collector to distribute the requests from allcollector units (CUs) to maximize register-bank accesses in eachcycle.A warp instruction is allocated to one of the CUs in the operandcollector. The valid flags in the CU are set indicating which operandsto fetch from the register file. The operands are then queued at thearbitrator. Since the arbitrator is optimized for throughput, andnot for individual warp latency, only one operand can be collectedto each CU in each cycle. Hence, it may take a few cycles beforeall operands for a certain warp instruction are collected. When alloperands for a warp instruction are collected, i.e., when all ready flags in one CU are set, the warp instruction is ready to be for-warded to the execution units.
To store operands at a fine granularity, each thread register is di-vided into slices (4 bits each to efficiently support the floating-pointformat described in Section 5.2), with an operand comprising datacontained in multiple slices (see Figure 2). An indirection table points out in which registers and which slices the operand is stored.The allocation of slices to each operand is static for each kernel (tobe described in Section 4), so the configuration of the indirectiontable is different for each kernel. Changes to the baseline registerfile are confined to the operand collector as the indirection tablekeeps all information needed to access individual registers.
The changes to the operandcollector comprises the green blocks in Figure 1.When a register read instruction is allocated to a CU, it collectsthe location information for each operand from the
Source indirec-tion table . The operand is then queued to the arbitrator as before.The data returned from the register file is compressed, i.e., onlysome slices of the returned data contain valid data. The
Value Ex-tractor rearranges the slices such that the data is properly aligned,and the aligned data is returned to the operand collector. It also !" !!!! !!!! !!!! !!!! !!!! " :/+0’/34%5&3/’$)-8’-;<= !!!! !!!! !!!! !!!! !!!! !!!! !!!!
29 :/+0’/34%5&3/’$)-8’0;<=
Figure 3: Example: a 16-bit float is split into two separateregisters. After a fetch, the TVEs extract and align the data. sign-extends the operand if it is an integer. If the operand is a nar-row float, however, it needs to be extended to single precision bythe
Value Converter before being forwarded to the execution units.Register write instructions collect location information from the
Destination indirection table . However, bank conflicts might occurin the indirection table so a buffer temporarily storing conflictingoperands is added. As the probability of a conflict is small (thewriteback bus is three operands wide, and the number of banks inthe indirection table is 16) the number of buffer entries is negligi-ble; one entry corresponds to one warp-register, which is on theorder of 0.1% of the size of the register file. To store a float operandwhich takes up less than 32 bits, it is converted to lower precision.Then, the operand is aligned to its corresponding placement insidethe physical register using a
Value Truncator . Since the source indirection tableis on the critical path, it is vital that it has the same throughput asthe register file. To guarantee this, the organization of the indirec-tion table is similar to that of the operand collector: the SRAM cellsare divided into 16 banks. A separate arbitrator distributes read re-quests in such a way that as many banks as possible are accessedeach cycle. We assume 256 architectural registers, where the in-direction table has to store 32 bits for each of them. A detailedarea analysis of the indirection table and the other structures inthe proposed register file is provided in Section 6.4.To avoid contention we introduce separate, yet identical, indi-rection tables for the read and the write paths.
When the content of a physical registeris read from the register file, it is expected that only a few slicescontain the required operand. These slices need to be extractedfrom the rest of the data. As shown in the example scenario ofFigure 3, the data of a 16-bit float operand is placed within twoseparate physical registers: data slice 0 is placed in slice 7 in phys-ical register r0, while data slices 1, 2, and 3 are placed in slices 2, 3,and 6 in physical register r1. To restore the data, the thread valueextractor (TVE) rearranges the slices, and sets unused slices to zero.Later, when both physical registers are fetched, the two parts aremerged into a complete operand using an OR operation.As Figure 4 shows, each value extractor consists of 32 parallelThread Value Extractors (TVEs). Each of them carries out the ex-traction for one thread register and it consists of eight 9-to-1 andone 2-to-1 multiplexer. The mask together with some logic gates,connected to the multiplexer select lines, decides the placement of " $ % & ’ ( ) " * ( + ) !" $ %&’ ,-)%". !" %%%%’’’’ Figure 4: The value extractor includes 32 TVEs. Each TVE in-cludes eight 9:1-multiplexers, which select among the inputslices and a nibble of zeros or ones. the input slices, zeros, and ones in the output. The 2-to-1 multi-plexer decides whether the value should be padded with zeros orones, depending on the type of operand. A float or unsigned inte-ger should always be padded with zeros, while a signed integer issimply sign-extended.
The CU is extended with fourfields per operand, as shown in the lower-right part of Figure 1.The first is a bit which indicates whether the operand is signedor not. The second is a convert info flag, which indicates whetherthe operand is a float which needs to be converted. The third isa location flag, which indicates whether the operand location hasbeen fetched. The fourth is an indirection info field, which is filledby an access to the indirection table.The CU is also extended with a 1024-bit OR-gate, which is usedif the operand is split into two different physical registers. The firstpart that is fetched is simply placed into the operand field. Whenthe second part arrives, it is OR’ed with the data in the operandfield to form a complete operand.
The Value Converter (VC) extends low-precision float operands to single-precision floats. Since two in-structions can be scheduled in each cycle, and each instructionhas up to three source operands, up to six conversions need tobe carried out in each cycle to maintain the maximum throughput.Hence, the VC consists of six parallel Warp Value Converters. Eachof these, in turn, consists of 32 parallel Thread Value Converters.The low-precision format we use mimics the IEEE 754 standard,with support for plus/minus infinity and not-a-number values. Dur-ing format conversion, denormals are truncated to zero, which issafe as the same simplification is made in the precision selectionstep described in Section 4.1. !" !" !" ’ % ( ) $ ! )* + % , - . % / ’ % ( ) $ ! )* + % , - < =>& Figure 5: The value truncator converts the operand to lowerprecision and places the data into its assigned slices. !" !" -".’%$(+*/+’$%&+"*(&01
Figure 6: Proposed extension of the pipeline (in green).
Before an operand with narrow bitwidthis written back into the register file, it has to be adjusted to its as-signed slices. This is carried out by the Value Truncator depicted inFigure 5, which comprises three Warp Value Truncators (WVTs).This is because we assume that the writeback bus is three instruc-tions wide, as modelled by GPGPU-Sim [3]. Similar to the WTC,each WVT consists of 32 smaller units called Thread Value Trun-cators (TVTs). Each TVT carries out the required steps before theoperand can be written back. In Step 1, if the operand is a nar-row float, it is converted to lower precision. If not, this step isskipped. In Step 2, the data is placed within its corresponding reg-ister slices. This procedure is the same as in Figure 4, but withanother set of logic for the select lines. Since an operand can besplit and placed into two physical locations, two thread value ex-tractors are needed.In the last step, VTs forward compressed data together with themasks to the register file. At writeback, only the bit lines corre-sponding to the mask are activated, so as to not overwrite the datain the other slices.
To maintain the baseline clock speed, we mod-ify the pipeline according to Figure 6, where the stages marked ingreen are added to that of the baseline marked in white. In the un-modified pipeline, the operand collector is in charge of sending allits operands to the register fetch stage, and not passing the instruc-tion to the execution stage before all of its operands are collected.In our modified pipeline, the operand collector is also responsiblefor synchronizing the accesses to the source indirection table, andsending floats to the value converter. We assume all new stagescan be carried out in one clock cycle, as will be justified in the nextsection.
The indirection table has the same organizationas the register file, so we assume the same timing with a maximumthroughput of 16 accesses per cycle.We estimate the propagation delay of the value converter usingCatapult C together with the NanGate 45 nm Open Cell Library bysynthesizing it to the register transfer level (Note: Fermi is imple-mented in 40 nm). A critical path analysis shows that the delay iswell within a Fermi clock cycle (0.71 ns). Since the converter hassix parallel units, we assume a throughput of six conversions percycle.The value extractor has a shallow critical path of one multi-plexer. Therefore, we assume it can be carried out within a register-read cycle, and no additional cycles are added.At writeback, destination operands are looked up in the des-tination indirection table and, if necessary, truncated using thevalue truncator. The destination indirection table is identical to thesource indirection table, and consequently we assume the same " Figure 7: Overview of the static framework. timing. Furthermore, we assume that the value truncator has asimilar propagation delay as the value converter. Hence, the min-imum writeback delay is two cycles if conversion is needed, andone cycle otherwise. However, in Fermi, the writeback bus is threeoperands wide, which means that bank conflicts are possible in thedestination indirection table. To account for these, we pessimisti-cally model the additional propagation delay as three cycles for all operands.
The static analysis framework (see Figure 7) comprises three steps:a range analysis step (Section 4.2), which identifies and reducesthe bitwidth of integer operands, a precision-reduction step (Sec-tion 4.1), which tunes the precision of the floating-point operands,and a register allocation step (Section 4.3). The range analysis andthe precision-reduction steps find and annotate all operands withtheir needed bitwidths. The register allocation step assigns a suit-able number of slices to each operand. Before execution of a kernel,the kernel-specific indirection information is loaded into the indi-rection table.
To reduce the bitwidth of floating-point operands, we employ amethod proposed by Angerd et al. [1]. Their precision-tuning methodis a heuristic whose goal is to identify how much the precision ofeach floating-point value can be reduced while meeting a speci-fied quality threshold. To achieve this, it uses as input an applica-tion, a quality threshold, and a number of application sample in-puts. These inputs are used to determine how much the precisionof each floating-point value can be reduced to meet the quality re-quirement. It then recursively explores how much the precisionof each floating-point value can be reduced. This is carried out atthe instruction level, where the instructions are in Single Static As-signment (SSA) form, meaning that each value corresponds to onesingle value definition. Each SSA register is then annotated witha bitwidth which meets the targeted quality threshold. Obviously,since this approach is data driven, it relies on a domain expert toprovide a set of representative sample inputs. No quality guaran-tees are given for inputs outside of the set.
To detect narrow integers offline, we propose to use static rangeanalysis [21]. Originally, range analysis was used to secure pro-grams against integer overflows. However, we use static rangeanalysis to determine the number of bits needed for each integeroperand. This is carried out at the instruction level. The steps takenin the analysis are shown in Figure 8: First, the program is con-verted into a control flow graph (CFG) which uses a representation k = 0 while k < 50 { i = 0j = k while i < j { print ki = i + 1k = k + 1 }} print k k = φ (k , k )k < 50?k = 0 k t = k ∩ [ −∞ ,49]i = 0j = k t print k f i = φ (i ,i )i < j ? k = k t + 1print k t i = i + 1 tft f (a) (b) I[k ] = [0,0]I[k ] = [0,50]I[k ] = [1,50]I[k t ] = [0,49]I[k f ] = [50,50]I[i ] = [0,0]I[i ] = [0,49]I[i ] = [1,50]I[j ] = [0,49] I[k] = Ð I[k x ] = [0,50]I[i] = Ð I[i x ] = [0,50]I[j] = Ð I[j x ] = [0,49]k : 6 bitsi : 6 bitsj : 6 bits (c) (d)Figure 8: The steps of the static range analysis. (a): Exampleprogram. (b): CFG in e-SSA-form. (c): Ranges in e-SSA-form.(d): Range and required bitwidth of each original variable. called Extended SSA (e-SSA) form. This makes it possible to cap-ture inequalities enforced by control flow dependencies. E.g., thecode in Figure 8a is converted into the CFG in Figure 8b, where thefirst branch produces two versions of variable k : k t which is below50, and k f which is greater than or equal to 50. Next, the CFG isfed into the range algorithm [21]. It creates constraints based onthe CFG, analyzes them, and outputs a range for each e-SSA regis-ter (Figure 8c). Finally, we merge the ranges of all e-SSA registerswhich belongs to each original variable by finding the union oftheir ranges, as shown in Figure 8d. Finally, we determine howmany bits are needed to describe this range. To allocate registers, we make use of an existing algorithm [1]. Weextend the algorithm to consider the width annotations from therange analysis as well as the precision reduction step, and assigna sufficient amount of slices to each operand. The output containsinformation about the location within the register file for eachoperand (denoted ”indirection info” in Figure 7): a register namepoints out which physical register to access, and an 8-bit maskshows which slices within the physical register are allocated forthe architectural register. In addition, to minimize fragmentation,each architectural register can be split into two parts and placedinto the slices of two different physical registers, which is why eachentry in the indirection table in Figure 7 has two physical registers, r r
1, and two masks, m m We evaluate the impact of our proposed design by modifying GPGPU-Sim [3]. Table 2 summarizes the settings, which correspond tothe configuration of a Fermi GTX 480 GPU. While relatively old,this baseline is widely used in GPU microarchitecture research able 2: Summary of GPU parameters. Parameter Value Parameter Value(per GPU) (per SM)
Clock Frequency 1400 MHz Warp Schedulers 2SMs 15 Max Warps 48Scheduling Policy Greedy then oldest Thread Registers 32768L2 cache 786 KB Register Banks 16Register Bank Width 1024 bitsEntries / Bank 64Operand Collectors 16L1 cache 16 KBShared memory 48 KB
Table 3: Distribution of bits for each considered floating-point format. All configurations also include a sign bit.
Bits, Total 32 28 24 20 16 12 8Exponent bits
Mantissa bits
23 20 17 14 10 7 4 (e.g. [2, 9, 11]). The reason for that is that the basic SM pipelinein contemporary GPUs is similar to the one in Fermi. Therefore,our proposal is also applicable to newer architectures: the thread-to-register ratio has not changed much. Register shortage remainsa problem. In Section 7 we give further insight into how our pro-posal scales to other architectures than Fermi.GPGPU-Sim simulates NVIDIA’s instruction set PTX. We usethe framework described in Section 4 to annotate each PTX regis-ter with a bitwidth. Then, the register allocator outputs the indi-rection table contents in the form of register IDs and masks. Thisinformation is then uploaded to GPGPU-Sim and consulted beforeany register access is carried out.The PTX instruction set is an intermediate representation com-piled by ptxas, the proprietary NVIDIA backend compiler, into thetarget assembly code. Since we carry out annotations and registerallocation directly on PTX, our register usage deviates from whatptxas reports. In all cases, our liveness analysis reports slightlymore registers than ptxas does, since the PTX assembly code isnot fully optimized. Hence, our register usage is an overestima-tion compared to what is required in the executed assembly code.
The IEEE 754 standard defines five floating-point precision formats,of which three are supported by modern GPUs: double, single, andhalf precision (64, 32, and 16 bits respectively) [18]. The formats weconsider are listed in Table 3. Besides the standard 32- and 16-bitprecision formats, the rest are chosen to approximately maintainthe single-precision ratio between the exponent and mantissa bits.We choose this format because prior research [1] has shown thatit outperforms both using only the formats supported by the IEEE754 standard as well as mantissa truncation in how efficiently eachbit is used.As none of our benchmarks use double precision, we do notconsider precision formats larger than 32 bits.
Table 4: A summary of the evaluated kernels.
Quality Register usage Warps GroupName metric per thread per block
Deferred SSIM 47 8 1SSAO SSIM 28 8 1Elevated SSIM 46 8 1Pathtracer SSIM 50 8 1CFD % deviation 60 6 2DWT2D % deviation 38 6 2Hotspot % deviation 31 8 2Hotspot3D % deviation 42 8 2IMGVF % deviation 52 10 2GICOV % deviation 24 6 2Hybridsort Binary 36 8 3
We evaluate our work using eleven CUDA kernels from variousapplication domains common to GPUs, in which the occupancy isbounded by the register usage. The first four are graphics kernels.Deferred and SSAO are standard passes used in many modern realtime applications. Elevated and Pathtracer are both larger kernelstaken from the shadertoys [20] web site. Elevated generates animage of a fractal landscape through ray marching using commontechniques such as evaluation of fractals and perlin noise. Path-tracer implements a standard path-tracing algorithm.The other seven kernels are from benchmarks in the Rodiniabenchmark suite [4]. They are selected because their occupancyis limited by register pressure, and they are possible to run on thesimulator.Table 4 summarizes the kernels, together with their quality met-ric, their original register usage per thread, and their original occu-pancy. The graphics kernels all use the Structural Similarity Index(SSIM) [26] to measure quality, which is a well-established metricfor comparing the quality of e.g. compressed images. For Hybrid-sort, we use a binary quality metric, i.e. the output can be correct orwrong. For the remaining kernels, we use percentage of deviationfrom the correct output. Note that, while SSIM is a well-establishedquality metric for images, the % deviation metric might not alwaysbe ideal. The choice of quality metric has a large impact on boththe possibility to trade bits for output quality, as well as how us-able the end result is. Ideally, the quality metric should be decidedby application domain experts. However, in this paper, we use it todemonstrate the potential of our approach. The metric can easilybe replaced by something more appropriate, without any impacton our approach.
We first investigate what impact the static framework has on theregister pressure and occupancy. Second, we examine the perfor-mance impact in terms of instructions per clock (IPC). Third, wecarry out a sensitivity analysis with respect to writeback delay.Fourth, we present an area overhead analysis. F D D W T D H o t s p o t D H y b r i d s o r t G I C O V I M G V F H o t s p o t D e f e rr e d P a t h t r a c e r SS A O E l e v a t e d R e g i s t e r P r e ss u r e OriginalNarrow integers Narrow floats, perfect qualityNarrow floats, high quality Narrow integers + floats, perfect qualityNarrow integers + floats, high quality
Figure 9: Original register pressure and the register pressurewhen using the static analysis framework for two differentquality thresholds.
We consider two output quality thresholds. The first one is when no quality degradation is allowed, called perfect quality , and defineit as SSIM = 1.0 for Group 1 (see Table 4), and as 0% deviationfor Group 2. The metric of Hybridsort is binary and has only twolevels: perfect and not acceptable. The second threshold is when aslight quality loss is accepted. We call this high quality , and defineit as SSIM = 0.9 for Group 1, 10% deviation for Group 2, and perfectfor Hybridsort (since its quality metric is binary). Up to 10% qualityloss is generally acceptable [14], but note that this threshold shouldbe carefully considered by the domain expert.Figure 9 presents the impact the static framework has on theregister pressure: the y-axis shows the required number of regis-ters per thread, and we present 6 bars per benchmark. The first bar,from the left, shows the original register pressure. The second barshows the register pressure if only integers are compressed. Thethird and fourth bars show the register pressure if only floats areconsidered for compression, for perfect and high quality. The fifthand sixth bar show the register pressure if both integers and floatsare compressed, for perfect and high quality, respectively.The framework reduces the register pressure for all benchmarks.Hybridsort, GICOV, and IMGVF show the largest relative reduc-tion, since they respond well to both parts of the framework. Whilethe floating-point reduction framework is responsible for the largestreduction in register pressure overall, for some benchmarks (e.g.DWT2D, Hotspot3D, Hotspot) the static integer reduction frame-work is of key importance to achieve a register pressure reduction.Figure 10 presents the impact of the register pressure reductionon the occupancy. The first bar, from the left, shows the original oc-cupancy. The second and third bars show the occupancy when us-ing our proposed approach, for a perfect and a high output quality.Here, the entire framework is used, which means that both integersand floats are reduced. In all cases, the occupancy increases. How-ever in some cases, the decrease in register pressure does not trans-form into a corresponding increase in occupancy. This is becauseshared memory-usage can also limit the achievable occupancy. Forexample, consider the result of IMGVF. When going from perfectto high output quality, the register pressure is reduced from 29 to24 registers. If only register pressure was the limiting factor, theoccupancy would increase to four blocks, since 24 registers × ×
10 warps × = < C F D D W T D H o t s p o t D H y b r i d s o r t G I C O V I M G V F H o t s p o t D e f e rr e d P a t h t r a c e r SS A O E l e v a t e d A c t i v e T h r e a d B l o c k s / S M Original Indirection table, perfect quality Indirection table, high quality
Figure 10: Impact on occupancy for two quality thresholds. C F D D W T D H o t s p o t D H y b r i d s o r t G I C O V I M G V F H o t s p o t D e f e rr e d P a t h t r a c e r SS A O E l e v a t e d G e o m e t r i c M e a n I P C i n c r e a s e ( % ) Perfect Output Quality High Output Quality
Figure 11: Impact on IPC for two quality thresholds. more than 3 blocks can fit into the 48 KB shared memory of theSM.
Figure 11 shows the impact on IPC when using the proposed regis-ter file organization, for a perfect and high output quality. In manycases, the IPC correlates with the increase in occupancy, with anincrease in geometric mean of 15.75% and 18.6% for a perfect andhigh output quality, respectively. For CFD, DWT2D, Hotspot3D,IMGVF, Deferred, and Pathtracer we see a substantial increase inIPC (between 9% and 79%). However, for some benchmarks theIPC decreases. For GICOV and SSAO, the IPC decrease is due tocontention in the texture cache. For a perfect quality output, themiss rate increases from 76% to 86% for GICOV, and from 69% to73% for SSAO, which hurts performance.For Elevated, there is a decrease in IPC when targeting a perfectoutput quality, but a slight increase for the high quality output.This is because the new operand collector has a deeper pipeline,which requires more warps. We investigate the relationship be-tween IPC and pipeline stages further in Section 6.3.
In the evaluation in Section 6.2, we model the writeback delay for each operand as three clock cycles; one cycle for conversion fromhigh to low precision, one cycle for accessing the register file, andone cycle to account for possible indirection table bank conflicts.This is quite a pessimistic estimation since not every operand needsto be converted. In addition, the risk for bank conflicts is low sincethe writeback pipeline is three operands wide, and the number ofbanks is 16. Nevertheless, the true number of required cycles mightbe either more or less than three, which motivates a sensitivityanalysis of the writeback delay. F D D W T D H o t s p o t D H y b r i d s o r t G I C O V I M G V F H o t s p o t D e f e rr e d P a t h t r a c e r SS A O E l e v a t e d I P C Figure 12: Impact on IPC when varying the number of write-back delay cycles.
Figure 12 shows the resulting IPC for all benchmarks when as-suming four different writeback delays: 0, 2, 4, and 8 cycles. Formost of the benchmarks, the impact is small up to four cycles. Theexceptions are Elevated and GICOV, for which IPC significantlydeteriorates at four cycles. The reason is that GPUs do not in-clude forwarding but rely on a scoreboard that prevents schedul-ing dependent instructions resulting in lower IPC. Note that timinganomalies sometimes give non-intuitive increases of IPC for largerwriteback delays, e.g. in Deferred.
This section uses transistor count as a proxy for the area overheadof the green blocks in Figure 1.The value extractors are the most transistor hungry. Each thread-level value extractor (TVE) consists of eight 9:1-multi-plexers whichare 32 bits wide. Assuming each bit of each multiplexer can be im-plemented with eight 6-transistor AOI cells, the transistor count is1536 per TVE. Additionally, each TVE also requires one 2:1-multiplexer,four bits wide, which adds 6 × =
24 transistors. Since each warpconsists of 32 threads, each warp-level extractor requires about50K transistors. In total, this sums up to 50 K × = K tran-sistors, since one extractor per bank is needed.We synthesize the Value Converter to the register-transfer level.By analyzing the resulting gate network, consisting mainly of anadder and some multiplexers, we estimate the transistor count perthread-level value converter to be approximately 1300. This sumsup to 249,600 transistors for 6 warp-level value converters. Each in-direction table has 256 32-bit entries. Assuming each bit is storedwith a 6-transistor SRAM cell, the transistor count for each indi-rection table is 49,152. Two indirection tables are needed: 98,304transistors in total.The number of transistors for the value truncators can be es-timated using the value converter and value extractor overheads.Each thread-level value truncator consists of one thread-level valuetruncator and two thread-level value extractors. We assume thata value truncator requires roughly the same area as a value con-verter, since the steps taken are similar. Then, each thread-levelvalue truncator requires 1 × + × = × × = ,
016 for three warp-level value truncators.Finally, the extension for one CU consists of one 1024-bit wideOR-gate and additional SRAM-cells. Assuming a 6-transistor OR-gate per bit, and 35 bits additional storage per CU, the overhead per CU is 1024 × + × × = , , × = , ,
000 transistors in total. This is a pessimistic esti-mation, since no circuit-level optimizations have been considered.Still, it is a very small fraction (less than 1%) of the total transistorbudget, which is about 3.1 billion transistors for the GTX 480 chip.
We estimate the power overhead analytically by considering staticand dynamic power, separately.Generally, static power increases linearly with the circuit area:hence, we estimate the overhead in static power to increase lin-early with the area overhead estimated in the previous section.When it comes to dynamic power, we compare our proposeddesign with a twice as big register file. The rationale is that forsome benchmarks, our design more than double the number of ac-tive thread blocks. To reach the same occupancy by increasing theregister file size, it also has to (more than) double. Our conclusionis that our design increases dynamic power less than what a twiceas big a register file would do. We come to this conclusion basedon three reasons:First, the largest difference from the original pipeline is that theproposed design occasionally fetches two registers instead of oneduring a register read. Naturally, this behaviour increases the dy-namic power of the register read by 2x when a double-fetch hap-pens. However, how often this occurs is controlled by the com-piler, since it makes the decision whether an operand should i) besplit and placed in two different physical registers or ii) be placedcontiguously in one physical register. Hence, the compiler couldbe designed to be aware of the trade-off between minimizing frag-mentation (i) and minimizing power dissipation (ii).Second, in the worst case, the proposed register file organizationincreases the number of register fetches by 2x, which would leadto a doubling in power for each register read. However, note thatthis does not necessarily increase the power more than it would toinstead double the size of the register file. Because of the bankedregister file organization, a doubling in capacity means a doublingin the number of entries of each bank, which means a doubling inbitline length. Since most of the dynamic power consumed in anSRAM is due to bitline charging [5], a doubling in bitline lengthalso doubles the consumed dynamic power per register read.Finally, as for the rest of the added structures, we estimate theircontribution to the dynamic power overhead to be small: the dy-namic power from the value converters, value truncators, and valueextractors are negligible in comparison to the power consumptionof the large register file since the energy per operation is typicallyan order of magnitude below that of SRAM structures [19]. Fur-thermore, while the indirection tables are SRAM structures, theyare also very small in comparison to the register file.
Our evaluation uses the NVIDIA Fermi architecture as baseline.However, it is important to understand the design implications on ewer architectures. Here, we give a comparison on how it scalesto the NVIDIA Volta [17] architecture.Register shortage is a reality also in newer architectures. Whilethe total register-file size of Volta is much larger than Fermi (20480KB vs 1965 KB), the Volta architecture also supports more threadsin total: each thread only has 31 32-bit registers available at maxi-mum occupancy. Keeping the register count low continues to be aproblem for programmers, as the tuning has to be carried out man-ually [6]. This is a cumbersome task which requires the program-mer to either re-write the kernel or accept inefficient trade-offssuch as register spilling to reach a high occupancy. Hence, registershortage remains a problem which can be alleviated by employingour approach.When it comes to area, our estimate is that the overhead isslightly larger for Volta than for Fermi, but still very small (justover 2% of the total transistor budget). Note that this is a pes-simistic estimate, since no optimizations has been considered, asfurther discussed below. The main reason for the increase is thatVolta has a higher count of individual register files than Fermi. Wederive this conclusion from the discussion below.Recall from Section 3.1 that Fermi has two warp schedulers perSM. They share a register file of 256 KB, as well as all the computingunits in the SM. In comparison, each SM in the Volta architectureis partitioned into four processing blocks, where each block has adedicated warp scheduler, a register file of 64 KB, and its own com-puting units. Each register file requires its own operand collector,and thus dedicated indirection tables, value extractors, and valueconverters.Since the number of register banks scales with the maximuminstruction throughput (two per SM for Fermi vs one per process-ing block for Volta), and we need one value extractor per bank, weassume that Volta requires half of the value extractors needed fora Fermi register file, which corresponds to 400k transistors accord-ing to Section 6.4 . Assuming all other structures are unchanged,we get an area overhead of 1.8M-0.4M = 1.4M transistors per pro-cessing block, or 5.6M transistors per SM. The Volta architecturehas 84 SMs, which sums up to a total area overhead of 470 mil-lion transistors. Although this is a higher count than for the Fermiarchitecture, Volta also has a significantly higher transistor countof 21 billion transistors in total. As a result, the area overhead isstill a very small fraction compared to the total transistor budget(just over 2%). Note that this is a pessimistic estimation, since nocircuit-level optimizations has been considered. In addition, whileout of scope for this paper, it might be possible to share some ofthe structures between the processing blocks, which would furtherreduce the area overhead. GPU register file optimizations have been addressed in several priorstudies. Gilani et al. [7], Esfeden et al. [2], as well as Wang andZhang [25] investigate optimizations based on narrow integers whichare detected at run time, in stark contrast to our static approachwhich works for all types of narrow data. Voitsechov et al. [23]employ narrow integer-packing based on static analysis, but theydo not support floats. Angerd et al. [1] present a study on reducing the bitwidth offloating-point values, which is the method we adopt to tune theprecision of floats offline. Their method assumes an indirection ta-ble capable of handling floating-point operands of different bitwidths,but they do not present a microarchitecture design of the registerfile nor a complete register-file design capable of dealing with bothinteger and floating-point operands. We present a complete design,at the microarchitecture level, of such a register file.Other related studies include Jeon et al. [8] who investigate re-leasing dead registers and re-allocating them to other warps. Yu etal. [27] propose a technique which increases the number of activewarps by employing run-time allocation. Furthermore, Khorazaniet al. [9] propose a software-hardware co-mechanism where someoperands are statically allocated, while others time-share registers.Also, RegLess [10] uses a compiler-supported technique to only al-locate register file space to currently accessed regions of code, atechnique orthogonal to our approach. All these techniques areorthogonal to our, since they do not target reduction of registerpressure.Furthermore, Lee et al. [12] target register compression to en-able more power-efficient GPUs. While this technique might lowerthe physical register usage per thread, their microarchitectural im-plementation specifically targets power consumption, while wetarget performance improvements.
Modern GPUs rely on TLP to provide high throughput. The threadregister footprint limits TLP, since the state of all active threadsmust be readily available in the register file. In this paper, we pro-pose a new concept for efficient register-packing, which combinesstatic integer and float operand compression with a novel GPU reg-ister file organization capable of lowering the register footprint bydensely storing narrow operand values. We present a detailed mi-croarchitectural implementation of the proposed organization, to-gether with a performance evaluation and an overhead analysis.Our results show that the IPC of the investigated benchmarks canbe increased by up to 79%, 18.6% on average, when allowing for aslight quality output degradation.
ACKNOWLEDGMENTS
This work is supported by the Swedish Research Council undercontract numbers VR-2014-06221 and VR-2019-04929.
REFERENCES [1] Alexandra Angerd, Erik Sintorn, and Per Stenstr¨om. 2017. A Framework forAutomated and Controlled Floating-Point Accuracy Reduction in Graphics Ap-plications on GPUs.
ACM Trans. Archit. Code Optim.
14, 4, Article 46 (Dec. 2017),25 pages. https://doi.org/10.1145/3151032[2] Hodjat Asghari Esfeden, Farzad Khorasani, Hyeran Jeon, Daniel Wong, and NaelAbu-Ghazaleh. 2019. CORF: Coalescing Operand Register File for GPUs. In
Pro-ceedings of the Twenty-Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS ’19) . ACM, NewYork, NY, USA, 701–714. https://doi.org/10.1145/3297858.3304026[3] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. An-alyzing CUDA workloads using a detailed GPU simulator. In . 163–174.https://doi.org/10.1109/ISPASS.2009.4919648[4] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron.2009. Rodinia: A benchmark suite for heterogeneous computing. In . 44–54.https://doi.org/10.1109/IISWC.2009.5306797
5] Shin-Pao Cheng and Shi-Yu Huang. 2005. A low-power SRAM design usingquiet-bitline architecture. 135– 139. https://doi.org/10.1109/MTDT.2005.10[6] NVIDIA Corporation. 2019.
CUDA C++ Best Practices Guide:Calculating Occupancy . Retrieved February 28, 2020 fromhttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html . 330–341.https://doi.org/10.1109/HPCA.2013.6522330[8] Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram.2015. GPU Register File Virtualization. In
Proceedings of the 48th InternationalSymposium on Microarchitecture (MICRO-48) . ACM, New York, NY, USA, 420–432. https://doi.org/10.1145/2830772.2830784[9] Farzad Khorazani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, NuwanJayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU Register Time-Sharing. In
Proceedings of the 45th Annual International Symposium on ComputerArchitecture (ISCA ’18) . ACM.[10] John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey,Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time Operand Stag-ing for GPUs. In
Proceedings of the 50th Annual IEEE/ACM International Sympo-sium on Microarchitecture (MICRO-50 ’17) . ACM, New York, NY, USA, 151–164.https://doi.org/10.1145/3123939.3123974[11] Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017.Access Pattern-Aware Cache Management for Improving Data Utilizationin GPU. In
Proceedings of the 44th Annual International Symposium onComputer Architecture (ISCA ’17) . ACM, New York, NY, USA, 307–319.https://doi.org/10.1145/3079856.3080239[12] Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, Won Woo Ro, and Mu-rali Annavaram. 2015. Warped-compression: Enabling Power Efficient GPUsThrough Register Compression. In
Proceedings of the 42Nd Annual InternationalSymposium on Computer Architecture (ISCA ’15) . ACM, New York, NY, USA, 502–514. https://doi.org/10.1145/2749469.2750417[13] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam SungKim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling En-ergy Optimizations in GPGPUs. In
Proceedings of the 40th Annual InternationalSymposium on Computer Architecture (ISCA ’13) . ACM, New York, NY, USA, 487–498. https://doi.org/10.1145/2485922.2485964[14] Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing.
Com-put. Surveys
48, 4 (3 2016). https://doi.org/10.1145/2893356[15] S. Mittal. 2017. A Survey of Techniques for Architecting and Managing GPURegister File.
IEEE Transactions on Parallel and Distributed Systems
28, 1 (Jan2017), 16–28. https://doi.org/10.1109/TPDS.2016.2546249[16] NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.White Paper.[17] NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. White Paper.[18] NVIDIA. 2018. NVIDIA Turing GPU Architecture: Graphics reinvented. WhitePaper.[19] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky.2017. Dark Memory and Accelerator-Rich System Optimization inthe Dark Silicon Era.
IEEE Design Test
34, 2 (April 2017), 39–50.https://doi.org/10.1109/MDAT.2016.2573586[20] I. Quilez and P. Jeremias. 2017.
Shadertoy
Proceedings of the 2013IEEE/ACM International Symposium on Code Generation and Optimization(CGO) (CGO ’13) . IEEE Computer Society, Washington, DC, USA, 1–11.https://doi.org/10.1109/CGO.2013.6494996[22] Vijay Sathish, Michael J. Schulte, and Nam Sung Kim. 2012. Lossless and LossyMemory I/O Link Compression for Improving Performance of GPGPU Work-loads. In
Proceedings of the 21st International Conference on Parallel Architecturesand Compilation Techniques (PACT ’12) . ACM, New York, NY, USA, 325–334.https://doi.org/10.1145/2370816.2370864[23] Dani Voitsechov, Arslan Zulfiqar, Mark Stephenson, Mark Gebhart, andStephen W. Keckler. 2018. Software-Directed Techniques for Improved GPURegister File Utilization.
ACM Trans. Archit. Code Optim.
15, 3, Article 38 (Sept.2018), 23 pages. https://doi.org/10.1145/3243905[24] Vasily Volkov. 2016.
Understanding latency hiding on gpus . Ph.D. Dissertation.UC Berkeley.[25] X. Wang and W. Zhang. 2017. GPU Register Packing: Dy-namically Exploiting Narrow-Width Operands to Improve Per-formance. In . 745–752.https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.308[26] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004.Image quality assessment: from error visibility to structural similar-ity.
IEEE Transactions on Image Processing
13, 4 (April 2004), 600–612.https://doi.org/10.1109/TIP.2003.819861 [27] Licheng Yu, Yulong Pei, Tianzhou Chen, and Minghui Wu. 2016. Architecturesupported register stash for GPGPU.
J. Parallel and Distrib. Comput.
89 (2016),25 – 36. https://doi.org/10.1016/j.jpdc.2015.12.00389 (2016),25 – 36. https://doi.org/10.1016/j.jpdc.2015.12.003