A Survey of Novel Cache Hierarchy Designs for High Workloads
AA Survey of Novel Cache Hierarchy Designs forHigh Workloads
Pranjal Rajput * Delft University of Technology
Sonnya Dellarosa * Delft University of Technology
Kanya Satis * Delft University of Technology
Abstract —Traditional on-die, three-level cache hierarchy de-sign is very commonly used but is also prone to latency,especially at the Level 2 (L2) cache. We discuss three dis-tinct ways of improving this design in order to have betterperformance. Performance is especially important for systemswith high workloads. The first method proposes to eliminateL2 altogether while proposing a new prefetching technique, thesecond method suggests increasing the size of L2, while the lastmethod advocates the implementation of optical caches. Aftercarefully contemplating results in performance gains and theadvantages and disadvantages of each method, we found the lastmethod to be the best of the three.
I. I
NTRODUCTION
In the world of server computing, one of the main aspectsthat is constantly being improved is speed. While there aremany things that contribute to the speed of a server computer,cache plays a fairly crucial role in being able to increase it.This is the reason why we chose cache hierarchy design asthe topic of our assignment.The main objective of this paper is to discuss three differentapproaches of cache hierarchy design that could improve theperformance of a system with high workloads in comparison toa traditional three-level hierarchy design. The there levels arecalled Level 1 cache (L1), Level 2 cache (L2), and Last LevelCache (LLC). L1 has the least amount of memory latency.Cache is also mostly on die, which means that unlike memory,cache is usually fabricated on the same chip as the processor.This survey will look into three different perspectives,offered by three different studies [1] [2] [3], in tackling thisissue. One solution suggests a new algorithm in servicingcache requests, one examines different sizes of the physicalarchitecture of the hardware, while one investigates an opticalbus-based architecture.After a careful examination of the three possible solutionsby looking at its benefits, drawbacks, and most importantly,the performance gains, we conclude that Solution 3 is the mostoptimal solution. The reason for this is that Solution 3 managesto considerably improve the performance for high workloadswhile also addressing the problems that arise in the other twosolutions.In Section II, we provide a detailed description of eachsolution in separate sub-sections. In Section III, we comparethe three different solutions, in terms of their performance * Denotes equal contribution gains and the benefits and consequences of each of them. Weconclude the survey in Section IV.II. D
ESCRIPTION OF THE E VALUATED S OLUTIONS
The first approach, described in Section II-A, considers thecriticality of instructions, where those of higher ones shouldbe served as quickly as possible, i.e. by the L1 cache. We willdefine the meaning of criticality, specify a particular way todescribe it, and introduce a new pre-fetching technique thatcan be used to ensure that the critical instructions are servedby the L1 cache. The second approach, discussed in SectionII-B, discusses increasing the L2 cache size so as to increasethe L2 hit rates and thus decrease the shared cache accesslatency. They also discuss the implementation of an exclusivehierarchy to stay in accordance with the design policies andhow this change increases performance as well. Lastly, thethird approach, outlined in Section II-C, is about implementingoptical caches and how they are a significant improvement tothe conventional cache hierarchies.
A. Solution 1
A conventional cache hierarchy design typically implementsan egalitarian approach, where all instructions are treated asequals and none has a higher priority than others. This is actu-ally not ideal, some instructions may be more damaging to theperformance than others, as these instructions could cost morenumber of cycles. A lack of focus in such bottleneck-inducingissue forms the fundamental motivation for
Criticality AwareTiered Cache Hierarchy (CATCH) , proposed by Nori et al.[1] as a possible solution.The main goal of CATCH is to first identify the criticalityof the program at the hardware level and then use new inter-cache prefetching techniques such that data accesses that areon a critical execution path are serviced by the innermost L1cache, instead of the outer levels L2 or LLC.Criticality in CATCH is calculated with the help of thedata dependency graph (DDG) of the program. In this graph,which was first introduced by Fields et al. [4] each instructionhas three nodes. One node denotes the allocation in the core,another node denotes the dispatch to execution units and thelast node denotes the write-back time. The links betweennodes of two different instructions represent the dependenciesbetween the two. After an instruction completes, its nodes arecreated in the graph. After the instruction has been bufferedat least two times the reorder buffer size, they are able to a r X i v : . [ c s . A R ] J a n dentify the critical path of this particular instruction, byfinding the path with the most weighted length. They thenwalk through this critical path and mark the critical loadinstruction addresses, otherwise known as Program Counters(PC), that were on the path. For a processor with a reorderbuffer size of 224, the total area required for this criticalpath identification on the hardware is about 3 KB. The areacalculation is discussed in details in [1].However, it is important to note that criticality is not onlyaffected by the executed application, but also by the hardware,due to its responsibility in determining the execution latencyof any instruction. This means that the critical paths willdynamically change. It is therefore important to dynamicallycalculate the critical paths throughout the execution of theapplication.Now that they have identified the critical loads, they need toprefetch these loads in the L1 cache. As such, the cache linesthat exist in the outer cache levels and correspond to the criticalload accesses need to be prefetched into L1 in a timely manner.Due to its size, L1 has a small capacity and bandwidth. It istherefore imperative to only prefetch a select set of criticalloads that affect the performance the most. Additionally, theyalso need to avoid overfetching into the L1 cache as thismight cause L1 thrashing, where new critical paths may beformed and performance will hinder performance. Thus, theauthors propose Timeliness Aware and Criticality Triggeredprefetchers (TACT), which prefetch the data cache lines fromL2 and LLC into the L1 cache just in time before they areneeded by the core.Cache prefetching is a function of Target-PC , which isthe load PC that needs to be prefetched,
Trigger-PC , whichis the load instruction that triggers the prefetching of theTarget, and
Association , which is the relation between theattributes of the Trigger-PC and the address of the Target-PC. The goal of TACT is to learn this relation to allow timelyprefetching for the Target-PC. To this end, TACT is split intothree components, TACT Cross, TACT Deep Self, and TACTFeeder.Cross associations of trigger address often appear whenTrigger-PC and Target-PC of a load instruction have the samebase register but different offsets. TACT Cross involves amechanism in the hardware that can spot these cross associa-tions in order to utilize them. First, they note that more than85% delta values of cross address association are observed tobe within a 4 KB page, which indicates a very high chance thatthe Trigger-PC accesses the same 4 KB page as the Target-PC. They then monitor the last 64 4 KB pages, until one theTarget-PC has correctly identified one stable Trigger-PC. Afterthis training, TACT Cross will be able to provide a prefetchjust in time before the core dispatches the Trigger-PC.TACT employs a stride prefetching method, though the typ-ically used prefetch distance is actually not punctual enoughto hide all the L2 and LLC hit latency. On the other hand,if the prefetch distance of all load PCs is increased, therewould be too many prefetches that may worsen the L1 latencyand negatively affect performance. TACT Deep Self, therefore, increases the prefetch distance of a select critical load PCs inorder to avoid hurting performance.The last component, TACT Feeder, is necessary for whenthere exist no address associations for the critical loads,in which case TACT will attempt data associations. TACTFeeders will track the dependencies between load instructionsand establish a linear function of the form
Address = Scale × Data + Base between the Target PC address and the Triggerdata until stables values for the variables
Scale and
Base areidentified. The data from the Trigger-PC can then prompt aprefetch for the Target-PC.They performed a preliminary investigation to examine theperformance gain of such prefetching technique and foundthat a configuration with and without L2, for high amountsof PCs, yield similar performance gains. Supported by theauthors’ recognition of the current trend of increasing thesize of L2, the benefits of which they identified as causedmainly by the critical loads hitting the L2 cache, stems theidea of eliminating L2 from the architecture of the cachehierarchy. Thus, with CATCH, L1 serves the primary purposeof servicing the critical loads, while LLC functions in reducingmemory misses and preventing the formation of new criticalpaths. A two-level cache hierarchy would consequentiallyreduce the chip area redundancies and provide more room toincrease the size of LLC or new cores. Although, they notethat this significantly increases the interconnect traffic. Thetotal area necessary for all the TACT components is around1.2 KB.For evaluation, benchmarks of single thread workloads areexecuted on a configuration with a large L2 (1 MB) and ashared, exclusive LLC of different sizes. The performancegain or loss are examined after the L2 is removed and thenworkloads are re-tested without then with CATCH. Thereseems to be a performance gain of 10% on a server. WithCATCH, a higher number of workloads and a bigger LLC sizeresult in better performance gains. CATCH is also tested formulti-programmed workloads, where a three-level CATCH-implemented hierarchy gives better performance by 8.95%,while a two-level CATCH-implemented hierarchy yields aperformance gain of 8.45%.
B. Solution 2
Another solution to this problem is given by [2]. In thispaper, they propose a change in the hierarchy by changingthe size of smaller caches so as to improve the average cacheaccess latency. This paper uses the server workloads as theapplication instance for which the improvements are made.This is an important application instance which works onmulti-core servers, but the cache hierarchies are not necessarilydesigned targeting server applications.The modern processors are designed with three-level cachehierarchy having small L1 and L2 for fast cache access latencyand a large shared LLC so as to accommodate for varyingcache capacity demands. This type of hierarchy with smallprivate L2 cache is a good design for those applications whichfit into the L2 cache size. However, for bigger applications thatre much larger than the L2 cache size, this design resultsin degraded performance as most of the execution time isspent on on-chip interconnect latency. While this can be solvedby the design of a private or shared LLC and prefetchingtechniques to hide L2 miss latency, the access patterns for theserver load applications are hard to predict and too complexfor industry use. Therefore, the direct solution to reduce theoverhead shared LLC latency by building large L2 caches isproposed in this paper as the better solution. The performanceimprovement for this methodology has been established in thispaper by use of simulations of server workloads on a 16 coreCMP.By increasing the L2 cache size, more of the workload willbe serviced at the L2 hit latency instead of the shared cacheaccess latency. The simulations showed that by increasingthe L2 cache size alone, without any change to the otherproperties, the performance improved by about 5% both inthe presence and the absence of prefetching techniques. Also,a sensitivity study was implemented and found that perfor-mance improvement was predominantly due to servicing coderequests at L2 cache hit latency, suggesting the improvementof L2 cache management by preserving latency critical codelines over the data lines as a possible design change.Though the performance seems to have improved, there areother drawbacks of this methodology that need to be correctedfor a successful implementation. It was seen that though theaverage cache access was improved by increasing the L2 cacheaccess latency for the server loads, the increase in size cansignificantly affect the performance of those with a workingset that is fit into the smaller L2 cache. Furthermore, anychange in the size of the caches needs to conform to thedesign properties of the cache hierarchy. For inclusive cachehierarchy, the L2:LLC area ratio needs to be maintained inorder to support the inclusion property. Ignoring this wouldresult in a waste of cache capacity and also create a negativeeffect of inclusion. On the other hand, increasing the LLC sizeto overcome this is not a good option. Hence, the size of thecache could be increased by either stealing the space from theLLC itself or going for exclusive or non-inclusive hierarchyinstead.The paper proposes an exclusive hierarchy that meets thedesign requirements and the manufacturing constraints. TheL2 size can be increased while the constraints of the totalon-chip area for cache space is maintained if inclusion isrelaxed. It was seen that by moving to exclusive hierarchy,the effective caching capacity is maintained while the averagecache access latency can be improved. Though, this comes ata cost, as increasing the L2 cache can reduce the observedshared LLC capacity which degrades the performance of thebigger workloads which might have fit in the base LLC butnot of the smaller ones. Besides, with an exclusive hierarchywhere duplication is not allowed, longer access latency mightbe produced in order to service data. Despite these trade-offs,the results showed a significant performance increase for thecommercial workloads.To further better the performance of using exclusive hierar- chy, the authors propose the use of including a single bit perL2 cache line called Serviced From LLC (SFL) bit for Re-Referenced Interval Prediction (RRIP) to improve the cachehit rates. An exclusive hierarchy invalidates lines on cachehits and this creates a new challenge for cache replacementpolicies which are meant to preserve lines that receive cachehits. However, in an exclusive LLC, the re-referenced cachelines are not preserved and using the same policy proposedfor the inclusive hierarchy does not provide any improvementon the performance either. Thus, the author proposes storingthe re-reference information in the L2 cache using SFL bits.This was illustrated using RRIP replacement policy. It wasseen that the conventional RRIP actually reduced the hit-rateof 25% while the use of SFL bit in the L2 cache helps replacethe RRIP functionality and considerably improves the hit ratesby 50%.The authors also propose preserving latency critical codelines in L2 caches over the data lines by using Code LinePreservation (CLIP). By implementing this, the majority ofthe L1 instruction cache misses are serviced by the L2 cache.Even though this allows preservation of the code lines in L2,if followed blindly, it can degrade the performance, especiallywhen the working set does not contend for L2. Hence, CLIP isset to dynamically decide re-reference prediction of the datarequests using Set Sampling. CLIP learns the code workingset dynamically and the L2 capacity is allocated accordingly.The results of these changes are studied in detail basedon the simulation runs. It was seen that using SFL-bit in L2caches to replace the RRIP functionality lost while migratingto exclusive hierarchy gives a performance boost of about 30%in RRIP functionality with about 5% performance degradation.It was also seen that by simple application of CLIP to thebaseline cache, the performance of about half the serverworkloads increases by 5-10%. This is equivalent to doublingthe L2 cache size, which establishes the importance of codelines when compared to the data lines in the L2 cache. It wasalso seen that workloads with the highest front-end misses per1000 instructions (MPKI) show more performance increasewhen L2 cache is increased along with CLIP. CLIP, thoughreduces front-end cache misses, increases the L2 cache misses.Still, the overall system performance is seen to increasesignificantly.Increasing L2 cache and decreasing LLC can have negativeeffects on workloads with working sets that fit in the largershared LLC but not in the new LLC. This can have aperformance decrease as high as 30%. Although, for higherworkloads such as the server loads, the larger L2 cache alwaysshowed better performance. With these improvements in place,the paper concludes showing about 5-12% increase in perfor-mance of server workloads as opposed to the conventionalhierarchy.
C. Solution 3
The third solution to the problem is the implementationof the Optical Caches in Chip Multiprocessor Architectures.The optical caches are being implemented on a separate chipather than on the same CPU die. Spatial optical waveguidesare used for the interconnection between the caches and CPUand the connection to the main memory. Wavelength DivisionMultiplexing (WDM) Optical interfaces are used for the cacheinterconnection systems.The main idea behind implementing the optical caches is toreduce the speed mismatch between the CPU and the mainmemory. The conventional ways of implementing a higherlevel of cache hierarchies not only increases the total area ofthe chip and the energy consumption but also the miss rateswith more number of cores and higher cache size.
Fig. 1. (a) Conventional Architecture (b) Shared Optical Cache Architecture[3]
P. Maniotis et al. [3] proposed a solution to the above prob-lem by presenting an optical cache memory architecture thatuses optical CPU-MM buses, in place of standard electronicbuses, for connecting all optical subsystems. In contrast to thestandard way of putting the L1 and L2 cache along with thecore in order to match their speeds, the optical caches are kepton a separate chip, as shown in fig. 1. A single-level, sharedoptical cache chip is used in the proposed architecture that liesnext to the CMP chip and consists of separate L1-instruction(L1i) and L1-data (L1d) caches. That means that CPU chipis free from caches and therefore, the die area is reduced andcan be used to add more cores to the CPU die.Additionally, WDM interfaces are used at the edges of thecores and the main memory for proper connection betweenthe subsystems using optical buses. The three optical busesconsist of multiple communication layers, the number of whichdepends on their operations. The first layer is CPU-L1d, usedby the CPU core to access the data from L1d. This againconsists of 3 discrete layers, namely the address, data-write,and data-read. The second layer is CPU-L1i, which loadsinstructions from L1i to the CPU core. Unlike CPU-L1d,this layer only consists of two layers: address and data-read.The final layer is the L1-MM bus, which executes read-writeoperations between the cache chip and MM.WDM is used to send various bits over a single waveguide,in contrast to the conventional ones in which n-bits are sentover n-parallel wire lanes. In the proposed architecture, a pair of wavelengths are used, one for sending the actual bitand another for sending its complement value. This meansthat transferring 8-bits worth of information would require16 wavelengths per waveguide. An optical coupler is usedfor the communication between the cores and waveguides,which propagates the data power in the bus. However, asmall amount of power is dropped at its attached core. Foroptical waveguides, Silicon-on-Insulator (SOI) technology orsingle mode polymer waveguides are used, as it causes lowpropagation losses of about 0.6 dB/cm and a small total densityof 50 wires/cm. In optical interfaces, filters and modulatorsbased on ring modulators, operating at the speed of 10 Gbps,are used for optoelectronic and electro-optical conversion.The transmitter module has a ring modulator that is used tomodulate CW signal and is tuned at a specific wavelength.For a 64-bit interface, a total of 8 transmitters are used, andeach waveguide carries 8-bits of information. In the receivermodule, add or drop ring resonator filters tuned to differentwavelengths are used. A photo-detector module is used toconvert any dropped wavelength into an electric signal whichis then stored into a bit-register.The optical caches consist of 5 subsystems: Read/WriteSelector (RWS), Row Decoder (RD), Way Selector (WS),2D Optical RAM Bank (2DRB), and Tag Comparator (TC).The RWS is used for read/write operations and for permittingand prohibiting the access to the 2DRB’s content. The RDis used for activation of the 2DRB lines on the basis ofaddress line fields requested by CPU. The WS is only usedduring writing operations to forward tag bit signals and theincoming data to the proper cache way. Data and tag parts ofthe cache are incorporated in the 2DRB and are implementedusing all-optical flip-flops. They are grouped together underthe same optical access gate after every 8 flip-flops usingArrayed Waveguide Grading (AWG) multiplexer. To determinethe cache hit and the miss rate, a Tag Comparator is used. Itconsists of a set of all-optical XOR gates which compare thetag bits received from the CPU with the ones being emittedfrom both ways of the RD-activated 2DRB’s row.From simulation results, it can be observed that a high-quality final cache output with an average extinction ratio of12 dB and up to 16 GHz speed for both writing and readingerror-free operations is obtained. In addition, Photonic Crystal(PhC) nanocavity technology provides compact multiplexermodules, switch and flip-flops that can be scaled to multi-bit storage devices and can result in completing optical cachememory solutions of low energy consumption in comparisonto the electronic units. The CMP architecture of the off-chipcache module is being compared with the conventional L1-L2 hierarchy and the L1 (with no L2 cache) hierarchy. Incontrast to the current L1 cache in CMPs, a single-level L1optical cache is shared among all the cores. The optical cachesmemory proved to be very fast in serving multiple requestsarriving from different cores into a single electronic core cyclewithout stalling the execution. However, in the case of sharedelectronic caches, L2 can be a bottleneck subject to multipleL1 cache misses. As such, the latency could be in the orderf tens of cycles. Optical cache flat hierarchy also helps insolving the problem of data consistency that is contingentupon an electronic multi-level design, where the same databeing cached in different units of cache gets updated.For simulations, a CMP is used with a number of coresvarying from 2 to 16 with powers of 2, a core speed of 2GHz,an MM of speed 1GHz and size 512MB, an optical cachesspeed of 2xN, while the speed of buses Core-L1, L1-MM issame as the optical caches clock using optical waveguides.The simulations are performed on the 6 benchmarks of thePARSEC suite, which includes 13 multi-threaded unique pro-grams covering diverse workloads for a variety of applicationdomains. These benchmarks combine medium and high dataexchange properties with either low or high data sharingproperties.
Fig. 2. bodytrack’s miss rates of L1i and L1d caches for Conv. L1, Conv.L1+L2 and Optical L1 architecture for 8 and 16 cores [3].
The simulation results reveal that miss rates for conventionalarchitectures are higher than the optical L1 architecture. Missrates in conventional architecture increase as the number ofcores increases, but not much variation is observed in othercases. L1i and L1d cache miss rates increase with the conven-tional architecture, but Optical L1 architecture is not affectedmuch since efficient inter-thread data sharing and exchangeis allowed by shared cache topology. Amplified performanceof up to one-tenth of the conventional architecture is obtainedwith N = 8 , and L1d size of 512KB and 1024KB, as can beanalyzed from fig. 2. It is observed that all the programs in thePARSEC suite had similar results to the body track program,having L1d miss rates equal to a fraction of conventional archi-tecture values. Optical L1 architecture handles L1i better thanthe conventional architecture even for small cache sizes. Asfar as execution cycles are concerned, optical caches providebetter execution times than the conventional architecture forthe same cache sizes, revealing the better performance of theformer. Execution times reduce significantly as cache size isincreased.So, the paper concludes that optical shared cache architec-ture provides much better performance in comparison withboth conventional architectures. Not only does it support highdata exchange and high parallelization, but it also providesultra-fast caching, an off-chip cache that helps in reducing thedie area, and also a lower cache capacity. For certain cases, the miss rate reduction is up to 96%, the average performancespeed is increased up to 20.52%, and the reduced averagecache capacity is increased up to 65.8%.III. C OMPARISON
The three solutions, though they all discuss various methodsto increase the performance of high workload applications,have very different strategies that they use to achieve theexpected results. For example, the Solutions 1 and 2 seemto use opposing strategies where one increases L2 cache todecrease on-chip interconnect latency while the other discussesdecreasing L2 to the extent of completely eliminating L2 suchthat L1 serves most of the cache requests. Solution 3 agreeswith Solution 1 in completely eliminating L2, but adds the useof optical caches to improve the performance and speed ofoperation. Nevertheless, all three methodologies contribute tosignificant improvements when working with high workloads.Figure 3 shows the performance improvement of Solution1. It can be seen from this figure that completely eliminatingL2 while implementing CATCH gives a performance boostof up to 13% for the server workloads. Similarly, Figure 4shows the performance increase of Solution 2, where it canbe seen that the performance for server loads increase by 5 to15% with a 1MB L2 and CLIP. The performance of Solution3 can be seen from Figure 5 where the execution cycle andconsequentially the performance improves by about 20%.
Fig. 3. Performance graph for Solution 1 [1]
Even though all these models show a high boost in per-formance, they have their own advantages and disadvantages.Apart from the gain in performance, Solution 1 has a majoradvantage of a smaller area, which can be exploited to increasethe cores of the processor. Moreover, the CATCH frameworkrequires a considerably small size of about 4 KB in hardware.The paper, unfortunately, fails to provide a solution to the highinterconnect latency, a problem of which is well addressed inSolution 2. In stark contrast, Solution 2 discusses expandingthe L2 cache to enhance the performance and simultaneouslyreduce the interconnect latency. Although, as a consequence,it comes with a disadvantage in terms of area consumptionand thus a fewer number of core processors. Furthermore,this solution creates a negative performance profile for otherapplication instances where the workload is much smallerwhich just as well could fit in the conventional L2 cache sizes ig. 4. Performance graph for Solution 2 [2]Fig. 5. Performance graph for Solution 3 [3] or for bigger workloads which fit in the conventional LLCspace.Solution 3 addresses both these problems of area consump-tion and interconnect latency by providing a new solution ofalso removing the L2 cache, but combined with the use ofoptical shared caches. As discussed earlier, removing L2 cachedecreases the on-die area which creates space for the inclusionof more cores, while the use of optical caches providesultra-fast caching. This solution maintains the performance ofconventional architectures for smaller workloads and at thesame time provides higher speed and better performance forbigger workloads. IV. C
ONCLUSIONS
In this report, we have discussed three different cache hier-archy designs that could improve the performance of systemswith high workloads. In Solution 1, we described a new inter-cache prefetching technique to process instructions according to their criticality, while also arguing about the possiblebenefits of reducing or possibly removing the L2 cache. Thissolution helps in improving the performance by up to 13%for server workloads. In Solution 2, we discussed increasingthe size of L2 instead in order to improve average cacheaccess latency. Not only does this help in reducing the on-chipinterconnect latency, but it also elevates the performance forserver loads from 5 up to 15%. In Solution 3, we consideredthe implementation of off-chip shared optical caches, whichare connected to the CPU core and the MM through opticalwaveguides. This results in better area utilization that can thenbe used to either instill more cores or simply reduce the diearea.WDM optical interfaces, along with the spatial-multiplexedoptical waveguides, are used to connect the cores-cache unitsand the MM-cache units. Such communication takes placein the optical domain. This architecture helps not only insignificantly reducing the L1 miss rate up to 96% in somecases but also in improving the performance by 20.52%along with reducing the average cache capacity up to 65.8%.The architecture proved to be useful in the case of highparallelization and heavy server loads. Ultimately, we concludethat all the above-discussed solutions are able to significantlyboost the performance for heavy server workloads, althoughSolution 3 comes up as the best solution in terms of its benefitsand drawbacks. R
EFERENCES[1] A. V. Nori, J. Gaur, S. Rai, S. Subramoney and H. Wang,
Criticality AwareTiered Cache Hierarchy: A Fundamental Relook at Multi-Level CacheHierarchies , 2018 ACM/IEEE 45th Annual International Symposium onComputer Architecture (ISCA), Los Angeles, CA, 2018, pp. 96-109.[2] A. Jaleel, J. Nuzman, A. Moga, S. C. Steely and J. Emer,
High performingcache hierarchies for server workloads: Relaxing inclusion to capture thelatency benefits of exclusive caches , 2015 IEEE 21st International Sympo-sium on High Performance Computer Architecture (HPCA), Burlingame,CA, 2015, pp. 343-353.[3] P. Maniotis, S. Gitzenis, L. Tassiulas and N. Pleros,
High-Speed OpticalCache Memory as Single-Level Shared Cache in Chip-Multiprocessor Ar-chitectures , 2015 Workshop on Exploiting Silicon Photonics for Energy-Efficient High Performance Computing, Amsterdam, 2015, pp. 1-8.[4] B. Fields, S. Rubin and R. Bodik,
Focusing processor policies via critical-path prediction , Proceedings 28th Annual International Symposium onComputer Architecture, Goteborg, Sweden, 2001, pp. 74-85.[5] Ying Zheng, B. T. Davis and M. Jordan,
Performance evaluation ofexclusive cache hierarchies , IEEE International Symposium on - ISPASSPerformance Analysis of Systems and Software, 2004, Austin, TX, USA,2004, pp. 89-96.[6] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer,