Sphynx: A Shared Instruction Cache Exporatory Study
Dong-hyeon Park, Akhil Bagaria, Fabiha Hannan, Eric Storm, Josef Spjut
SSphynx: A Shared Instruction Cache Exporatory Study
Dong-hyeon Park Akhil Bagaria Fabiha Hannan Eric Storm Josef Spjut
Abstract
The Sphynx project was an exploratory study to discoverwhat might be done to improve the heavy replication of in-structions in independent instruction caches for a massivelyparallel machine where a single program is executing acrossall of the cores. While a machine with only many cores (fewerthan 50) might not have any issues replicating the instructionsfor each core, as we approach the era where thousands ofcores can be placed on one chip, the overhead of instructionreplication may become unacceptably large. We believe thata large amount of sharing should be possible when the ma-chine is configured for all of the threads to issue from thesame set of instructions. We propose a technique that allowssharing an instruction cache among a number of independentprocessor cores to allow for inter-thread sharing and reuseof instruction memory. While we do not have test cases todemonstrate the potential magnitude of performance gainsthat could be achieved, the potential for sharing reduces thedie area required for instruction storage on chip.
1. Introduction
Instruction caches are widely used to mediate the effects ofreads from main memory, relative to computation time. Theinstruction cache is accessed for every instruction executedand program execution time can vary widely depending on thenumber of instruction cache misses [1]. In existing GraphicsProcessing Units (GPUs) and CPUs, each processor core hasits own instruction cache. A unified or shared instruction cachethat is used by many or all cores of a GPU or CPU has thepotential to improve system performance and reduce powerconsumption. However, such a modification also results inincreased traffic for the instruction cache, which could leadto a higher miss rate, reducing performance and increasingpower consumption.In current computer architectures, the instruction caches areindependent of one another. However, since each processoruses the same set of instructions, it is plausible that a sharedinstruction cache could introduce non-trivial improvements inperformance. Unified instruction caches could reduce the num-ber of compulsory misses because an instruction previouslyexecuted by one streaming multiprocessor may be available foranother streaming multiprocessor immediately rather than af-ter an additional miss is serviced. In addition to a fully unifiedinstruction cache used across all processor cores, another pos-sible solution could be to maintain multiple instruction caches,which are shared across a subset of all cores. This architecturecould allow the operating system to group together programthreads based on similarity of instructions to maximize the
Figure 1: Independent instruction caches benefit of shared instructions and minimize the conflicts acrossthreads.
2. Background
An instruction cache can have a large impact in a processor’sperformance. There has been lot of work in the past in improv-ing the performance of instruction cache CPUs. Techniquessuch as advanced branch prediction [12] and replacement poli-cies [7] have contributed to the high performance of instructioncaches in modern CPUs. It has even been shown that systemperformance can increase by 10-20% just by adjusting theoperating system to use the instruction cache efficiently [11].While all of these approaches are useful, the future of com-puting includes very large numbers of independent processorcores integrated on one chip. While general purpose CPUs arelikely to include hundreds of cores at some point in the future,a good example to consider today is the GPU, which alreadyexposes hundreds to thousands of threads to the programmer.Modern GPUs are designed to have several compute unitsto maximize throughput, and to expose these compute unitsfor general purpose computation instead of only graphics ap-plications. These compute units are compact, and designedto perform simple operations on a large set of data. It is notfeasible to implement techniques such as advanced branch pre-diction on these small compute units, because a single GPUunit can have hundred of these compute units. To improve theinstruction cache performance of GPUs, new solutions need tobe developed in ways that maintain the compact and simplisticnature of the compute units. Current GPU architectures are de-signed with individual L1 instruction cache for each computeunit of the GPU[3]. It may be possible to have multiple com-pute units share the same instruction cache without incurringserious performance or power penalties.
3. Shared Instruction Cache Design
Most computer architectures build an independent instructioncache for each processing core as seen in Figure 1. Reduc-ing the number of instruction caches for the same numberof cores in a multi-core processor may have certain advan- a r X i v : . [ c s . A R ] D ec igure 2: Partly shared instruction caches tages. First, a shared instruction cache design could reducethe amount of storage and die area required to hold instruc-tions on chip when multiple processors are executing the sameprogram. This advantage in storage and area could result inreduced performance if the pressure to this shared resourcebecomes too high. Similarly, the benefits of reduced cachesize will not be realized in multi-program workloads, unlessthe operating system is able to discover shared library codeand perform the proper virtual mapping to allow the hardwareto exploit it. Second, a shared instruction cache design couldreduce the number of compulsory misses because an instruc-tion previously fetched by one processor may be availablefor another processor immediately rather than requiring anadditional cache miss. Again, this benefit is only relevant forsingle-application parallel workloads.The first approach to shared instruction caches would beto couple an instruction cache with a pair of processors asin Figure 2. This would require a small amount of routingoverhead and allow for the operating system or programmerto intelligently schedule two threads that share instructionsto these two processor cores in order to exploit the sharedinstruction cache. While sharing an instruction between twocores is one step towards shared caches, at its limit, a chipcould be designed to share a single instruction cache amongall processors on chip, shown with an example of only fourprocessors in Figure 3. For an arbitrary number of cores,the level of sharing could be scaled up or down to fit theapplications and purpose of the architecture design. One couldeven imagine a design where assymetrical sharing is enabledby grouping different sets of compute cores differently andthen assigning the application threads to the appropriate coresto match the sharing to the hardware available.While our proposed technique is likely only beneficial insituations with highly parallel programs, it is probable thatfuture systems will often run a few highly parallel workloadsthat can benefit from these advantages. A hybrid approachwould allow some cores to efficiently process general purposeapplications while cores with shared instruction caches executeparallel single program workloads. When different threadswithin a parallel application diverge in program flow, it may beuseful to further design the instruction caches to have multipleindependent banks that can be accessed in parallel. We believethe multi-banked approach to be the most beneficial, howeverwe do not present results or analysis of multi-banked cachesin this work. Figure 3: Fully shared instruction cachesFigure 4: Hypothesis of area and miss rate for shared caches
4. Results
There are two methods for analyzing the potential gains fromshared instruction caches. The first is to keep the total instruc-tion cache size across the chip constant and increase the ca-pacity each thread sees by grouping that storage together. Thismethod of sharing will provide no area savings, but shouldgrant an increased hit rate for the now larger cache. The in-crease will be limited by the working set of the application.We expect that the hit rate improvement will only be minorand therefore do not generate any results for this method.The second method is to reduce the total storage requiredfor instruction cache data arrays while allowing at most a mi-nor degredation in hit rate. We consider an approach wherethe instruction cache each processor sees remains fixed but thenumber of threads sharing each cache is increased. We expectthat the miss rate of the cache will increase when the instruc-tion cache is not large enough to satisfy the demands from allthe cores sharing the cache. However, the performance penaltyshould not be as dramatic as the reduction in area when allthreads run the same application. Figure 4 shows our hypothe-sis for this approach which should improve the efficiency ofthe chip. Note that the scale on the y-axis is omitted becausethe chart is included to build intuition of the hypothesis onlyand is not based on any simulation or otherwise realistic result.
We use GPGPU-Sim [2] to analyze the effectiveness of ourdesign because of its ability to simulate a large number ofparallel processing cores. GPGPU-Sim is an open-source soft-ware package available to simulate GPU architecture. It hasbeen validated to be representative of performance on NVIDIAGPUs and provides a reasonable platform for testing alternate2ighly-parallel computer architectures. We use a referenceconfiguration for a NVIDIA GTX580 GPU for our study. TheGTX580 contains 16 streaming multiprocessors (SM) withthe NVIDIA Fermi architecture. In GPGPU-Sim, each CUDAstreaming multiprocessor is represented as a single SIMT core,with all the SIMT cores placed within a single SIMT cluster.Each streaming multiprocressor can have up to 48 warps, with32 threads per warp (See Table 1). All sixteen SIMT coresshare unified 786KB L2 cache.
Table 1: GTX580 Configuration in GPGPU-Sim
SIMT cluster count 1 / / / / / / / / / / / / / / / / / / / / / / / / GPGPU-Sim configurations used in simulation. Any parameternot listed in the table matches the original GTX580 configura-tion included in the public release of GPGPU-Sim.
In the Fermi architecture, each streaming multiprocessor hasits own distinct L1 cache. Each L1 instruction cache is 4-wayset associative, with 4 sets of 128 bytes per block. To assessthe performance of different L1 instruction cache architectures,only the L1 instruction cache was modified, while all otherarchitectural variables remained constant. Other aspects of thearchitecture such as the bandwidth of shared L2 cache couldserve as a performance bottleneck for different cache designs.However, these variables are ignored for this study, becausewe expect them to have little impact on the hit and stall ratesof the instruction cache, which are the primary metrics ofinterest.
We used benchmarks provided by the standard GPGPU-Simdistribution to validate our second method for sharing instruc-tion caches by keeping the cache the same but increasing thenumber of threads that share that cache. This means varyingthe number of SMs per instruction cache from 1 to 16 (themaximum number of cores available on the GTX580). Notethat since each SM can have up to 48 warps, the number ofsimultaneously executing threads sharing the same instruction
Figure 5: Simulated miss and stall rates for configurations inTable 1 cache is much larger than the SM count. The results can beseen in Figures 5 and 6. As expected, the miss rate increasesslightly in most cases by the increased pressure on the sharedcache. A couple of benchmarks were more problematic, result-ing in almost a 45% stall rate when shared by 16 threads. Weprovide observations about those benchmarks in Section 4.3.For the cases where the miss rate becomes unacceptableunder increased sharing, we should allow for the cache toincrease in capacity. However, capacity alone will only affectthe miss rate reported from the simulations. The stall raterepresents the percentage of cache accesses that fail due tothe increased pressure on the cache from sharing without in-creasing the available read parallelism. We expect that simplyallowing for independent banks of the instruction cache tobe accessed independently would overcome this limitation,but we have not yet simulated this approach. However, otherhigh-level simulation has shown this kind of instruction cachebanking scheme to be potentially effective [6].
The result of varying the number of SM sharing from 1 to 16 isshown in Figure 5, and Figure 6 shows a zoomed-in version ofthe same data to show that even when the magnitude of missesis very small, the expected trend is followed. A total of sevenbenchmarks were tested, each available from the GPGPU-Simsource code [2].The benchmark STO is mostly unaffected by the increasein number of cores accessing the same small instruction cache,and the benchmark shows a similar level of miss and stall rate.On the other hand, RAY exhibits a much higher stall rate thanits miss rates in Figure 5, and is evidently more sensitive toincreased sharing of the cache. Higher stall rate comparedto miss rate occurs when the cache is not able to respond tothe CPU on time, even when there is a cache hit. The rapidincrease in stall rate in RAY and the relatively modest increasein miss rate suggests the cache is not able to handle the rapid3 igure 6: Simulated miss and stall rates (zoomed-in Figure 5) increase in the amount of requests coming from the cores, andhas to stall even when there is a hit.As RAY showed to have the largest number of PTX in-structions amongst our set of benchmarks, the high miss andstall rate is reflective of the high level of instruction cacheutilization by the benchmark.Similar relationships between stall rate and miss rate arealso evident in benchmarks BFS, CP, and LPS. However, themiss rate of less than 1% in these benchmarks suggests that thebenchmarks are largely affected by initial compulsary misses,and operate largely on a small set of instructions.
5. Conclusions
The proposed instruction cache designs provide a possible so-lution to increase the efficiency of instruction cache in parallelprocessors, in particular processors whose primary workloadincludes single programs with many parallel threads execut-ing the same code. The sharing of instruction cache amongstmultiple cores can help chip designers optimize for less areawithout compromising performance. The extra space recov-ered from the sharing of instruction cache among multiplecores can be repurposed to improve other parts of the chip,such as a larger data cache(s) or additional computation units.We showed results for GPU workloads because they typicallyexhibit much higher levels of parallelism than CPUs while stillexecuting a single application. GPUs traditionally only ex-pose multi-program workload capabilities using corse-grainedtime-sharing among processes.For future work, the proposed instruction cache designshould be simulated and tested to verify the affectiveness ofthe design. The different cache architecture parameters thatare expected to affect the performance of instruction cache de-sign are: number of cores sharing the cache, associativity andsize of the cache. Extensive testing of such parameters shouldbe conducted to determine the most optimal instruction cachedesign for common workloads. The proposed design is ex-pected to perform the best on multi-threaded applications with a large amount of redundant instructions, so the drawbacks ofrunning non-redundant code should also be considered.
Acknowledgement
We acknowledge that this idea for reducing the instruc-tion cache area overhead by aggressively sharing the cacheoriginally came from previous work on the TRaX architec-ture [9, 8, 6, 10, 4, 5]. The TRaX architecture was designedfor ray tracing and supports thousands of threads running thesame application but working on different data. We wouldalso like to acknowledge the reviews from Konstantin Shkurkoand Steven Jacobs. This work was completed at Harvey MuddCollege in Claremont, California.
References [1] R. Arnold et al. , “Bounding worst-case instruction cache performance,”in
In IEEE Real-Time Systems Symposium , 1994, pp. 172–181.[2] A. Bakhoda et al. , “Analyzing cuda workloads using a detailed gpusimulator,” in
IEEE International Symposium on Performance Analysisof Systems and Software (ISPASS) , 2009.[3] S. W. Keckler et al. , “Gpus and the future of parallel computing,”
IEEEMicro , vol. 31, no. 5, pp. 7–17, 2011.[4] D. Kopta et al. , “An energy and bandwidth efficient ray tracing archi-tecture,” in
High-Performance Graphics (HPG 2013) , 2013.[5] D. Kopta et al. , “Memory considerations for low energy ray tracing,”in
Computer Graphics Forum , 2014.[6] D. Kopta et al. , “Efficient MIMD architectures for high-performanceray tracing,” in
IEEE International Conference on Computer Design(ICCD) , 2010.[7] J. Smith and J. Goodman, “Instruction cache replacement policies andorganizations,”
Computers, IEEE Transactions on , vol. C-34, no. 3, pp.234–241, March 1985.[8] J. Spjut et al. , “TRaX: A multicore hardware architecture for real-timeray tracing,”
IEEE Transactions on Computer-Aided Design , vol. 28,no. 12, pp. 1802 – 1815, 2009.[9] J. Spjut et al. , “TRaX: A multi-threaded architecture for real-time raytracing,” in , June 2008.[10] J. Spjut et al. , “A mobile accelerator architecture for ray tracing,” in , 2012.[11] J. Torrellas, C. Xia, and R. L. Daigle, “Optimizing the instruction cacheperformance of the operating system,”
Computers, IEEE Transactionson , vol. 47, no. 12, pp. 1363–1381, 1998.[12] T.-Y. Yeh, D. T. Marr, and Y. N. Patt, “Increasing the instruction fetchrate via multiple branch prediction and a branch address cache,” in
Proceedings of the 7th international conference on Supercomputing .ACM, 1993, pp. 67–76..ACM, 1993, pp. 67–76.