[PDF] Addressing Variability in Reuse Prediction for Last-Level Caches

Abstract

Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. However, applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks. In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction.

Full PDF

AAddressing Variability in Reuse Predictionfor Last-Level Caches

Priyank Faldu T H E U N I V E R S I T Y O F E D I N B U R G H Doctor of PhilosophyInstitute of Computing Systems ArchitectureSchool of InformaticsUniversity of Edinburgh2019 a r X i v : . [ c s . A R ] J un bstract Last-Level Cache (LLC) represents the bulk of a modern CPU processor’s transistorbudget and is essential for application performance as LLC enables fast access to datain contrast to much slower main memory. Problematically, technology constraintsmake it infeasible to scale LLC capacity to meet the ever-increasing working set sizeof the applications. Thus, future processors will rely on effective cache managementmechanisms and policies to get more performance out of the scarce LLC capacity.Applications with large working set size often exhibit streaming and/or thrashingaccess patterns at LLC. As a result, a large fraction of the LLC capacity is occupiedby dead blocks that will not be referenced again, leading to inefficient utilization ofthe LLC capacity. To improve cache efficiency, the state-of-the-art cache managementtechniques employ prediction mechanisms that learn from the past access patternswith an aim to accurately identify as many dead blocks as possible. Once identified,dead blocks are evicted from LLC to make space for potentially high reuse cacheblocks.In this thesis, we identify variability in the reuse behavior of cache blocks as the keylimiting factor in maximizing cache efficiency for state-of-the-art predictive techniques.Variability in reuse prediction is inevitable due to numerous factors that are outside thecontrol of LLC. The sources of variability include control-flow variation, speculativeexecution and contention from cores sharing the cache, among others. Variability inreuse prediction challenges existing techniques in reliably identifying the end of ablock’s useful lifetime, thus causing lower prediction accuracy, coverage, or both. Toaddress this challenge, this thesis aims to design robust cache management mechanismsand policies for LLC in the face of variability in reuse prediction to minimize cachemisses, while keeping the cost and complexity of the hardware implementation low.To that end, we propose two cache management techniques, one domain-agnostic andone domain-specialized, to improve cache efficiency by addressing variability in reuseprediction.In the first part of the thesis, we consider domain-agnostic cache management,a conventional approach to cache management, in which the LLC is managed fullyin hardware, and thus the cache management is transparent to the software. In thiscontext, we propose

Leeway , a novel domain-agnostic cache management technique.Leeway introduces a new metric,

Live Distance , that captures the largest interval oftemporal reuse for a cache block, providing a conservative estimate of a cache block’s iii seful lifetime. Leeway implements a robust prediction mechanism that identifiesdead blocks based on their past Live Distance values. Leeway monitors the change inLive Distance values at runtime and dynamically adapts its reuse-aware policies tomaximize cache efficiency in the face of variability.In the second part of the thesis, we identify applications, for which existingdomain-agnostic cache management techniques struggle in exploiting the high reusedue to variability arising from certain fundamental application characteristics.Specifically, applications from the domain of graph analytics inherently exhibit highreuse when processing natural graphs. However, the reuse pattern is highly irregularand dependent on graph topology; a small fraction of vertices, hot vertices , exhibithigh reuse whereas a large fraction of vertices exhibit low- or no-reuse. Moreover, thehot vertices are sparsely distributed in the memory space. Data-dependent irregularaccess patterns, combined with the sparse distribution of hot vertices, make it difficultfor existing domain-agnostic predictive techniques in reliably identifying, and, in turn,retaining hot vertices in cache, causing severe underutilization of the LLC capacity.In this thesis, we observe that the software is aware of the application reusecharacteristics, which, if passed on to the hardware efficiently, can help hardwarein reliably identifying the most useful working set even amidst irregular accesspatterns. To that end, we propose a holistic approach of software-hardware co-designto effectively manage LLC for the domain of graph analytics. Our software componentimplements a novel lightweight software technique, called

Degree-Based Grouping(DBG) , that applies a coarse-grain graph reordering to segregate hot vertices in acontiguous memory region to improve spatial locality. Meanwhile, our hardwarecomponent implements a novel domain-specialized cache management technique,called

Graph Specialized Cache Management (GRASP) . GRASP augments existing cachepolicies to maximize reuse of hot vertices by protecting them against cache thrashing,while maintaining sufficient flexibility to capture the reuse of other vertices as needed.To reliably identify hot vertices amidst irregular access patterns, GRASP leverages theDBG-enabled contiguity of hot vertices. Our domain-specialized cache managementnot only outperforms the state-of-the-art domain-agnostic predictive techniques, butalso eliminates the need for any storage-intensive prediction mechanisms. iv ay Summary Over the past few decades, technological advancements in the semiconductor industryhave made the processors and the main memory significantly faster. However, the mainmemory has been getting faster at a much slower rate than the processors, wideningthe gap between the speed of the processor and the main memory. Consequently,slow access time of the main memory is one of the major performance bottlenecks inmodern computer systems as the processor needs to access data items (i.e., programinstructions and data) from the main memory to perform computations.To avoid accessing the main memory for every data item the processor needs,the computer systems employ multiple caches between the processor and the mainmemory. A cache is a form of memory, which is significantly faster (and closerto the processor) than the main memory and thus retrieving a data item from thecache is much faster than retrieving it from the main memory. However, a cache issignificantly more expensive (in dollars per byte) than the main memory. Consequently,caches tend to have considerably smaller capacity in comparison to the main memory,warranting judicious use of the precious cache capacity. To that end, the goal of acache management technique is to decide which data items to store in the cache tominimize the number of accesses to the main memory.For cache management, Last-Level Cache (LLC) is of particular interest as it offersthe largest capacity among all caches. Cache management for LLC controls whichdata items are stored in the LLC. As application executes and accesses more dataitems, cache management predicts which data items are more likely to be reused inthe near future, and thus should be stored in the LLC. Meanwhile, when the cacheis full, cache management also predicts which data items are unlikely to be reusedin the near future, and thus can be removed. Naturally, the more accurate the reusepredictions, the better the cache efficiency.State-of-the-art cache management techniques for LLC observe cache accesspatterns of the data items over time and utilize this information to predict the futurereuse of data items. In this thesis, we show that the LLC observes inconsistent accesspatterns for many data items due to numerous factors that are outside the controlof the LLC. Thus, data items inevitably exhibit variability in the reuse behavior atLLC, limiting existing techniques in making accurate predictions. In response, thisthesis aims to design robust cache management mechanisms and policies for LLC tominimize cache misses in the face of variability in reuse prediction, while keeping the v ost and complexity of the hardware implementation low. To that end, we proposetwo new cache management techniques incorporating various variability-tolerantfeatures. vi cknowledgements It is impossible to get admitted to the PhD program of a world class university, letalone graduate from it, without the constant help, support and guidance from family,friends and teachers. Naming all of them is not possible, but I sincerely thank eachand every one of them from the bottom of my heart. Below, I specifically acknowledgea selected group of people without whom this thesis wouldn’t have been possible.First and foremost, my sincerest gratitude to Prof. Boris Grot, who has been trulya remarkable advisor throughout my PhD program. Boris has always been open fordiscussions and brainstorming, and has also given me the freedom to explore newproblems on my own. Boris not only helped me improve my research skills, but alsoensured my all round development; whether it was encouraging me to mentor studentsin their projects, trusting me with the teaching and tutoring duties, nominating me forthe organizing committee of ISCA, enabling me to network with the wider researchcommunity or even patiently helping me with my writing skills, his contributionshave been enormous. His critical thinking, great attention to details, and above all, hiscompassionate attitude towards his students make him the perfect advisor one couldask for. I am very grateful to Boris for all the guidance and support throughout myPhD, and also privileged to be his first PhD student.The other most important person to whom I am indebted is my wife, Kruti, for herunconditional love and unwavering support. She is the one who encouraged me topursue PhD, even if that meant getting off of the driver’s seat of her career. Wordsare not enough to describe her contributions as she took all the responsibilities uponherself to ensure I can focus on my research. Kruti has made several sacrifices for meto be able to complete my PhD, and for that she deserves an equal credit, if not more,for this thesis. She has been the source of encouragement during the tough times ofpaper rejections, and the perfect companion to celebrate every milestone on the way,little or big. I can safely say that Kruti has made me a better researcher, and moreimportantly a better person.I thank Oracle for the internship opportunity, and my mentors over there, Dr. JeffDiamond and Dr. Avadh Patel, for making my internship an enriching experience.The work that I started during the internship, and expanded in the subsequent years,turned out to be a stepping stone for my thesis, spanning three out of four technicalchapters.I am fortunate to have had the opportunity to interact with and learn from Prof. vii ijay Nagarajan, Prof. Björn Franke, Prof. Daniel Sorin, Prof. Babak Falsafi, Prof.Timothy Pinkston, Prof. Murali Annavaram, Prof. Daniel Jiménez, Prof. RajeevBalasubramonian, my thesis examiners Prof. Michael O’Boyle and Dr. Gabriel Loh,and the anonymous reviewers from the Computer Architecture community. Learningfrom the very best of the field has been a privilege, and has made a far reaching impacton me.Special thanks to my friends in the School of Informatics and my academic siblings,Artemiy Margaritov and Amna Shahab, without whom the days would have passed farmore slowly. They provided valuable feedback and suggestions to improve my ideas.Endless discussions, sometimes technical but more often not, provided much neededbreak during those intense days before the submissions. We have been through eachothers ups and downs together.I thank the faculty members, support staff and students of the School of Informaticsfor their help and support. I would like to specially thank Antonios Katsarakis, ArpitJoshi, Cheng-Chieh Huang, Dmitrii Ustiugov, Rakesh Kumar, Saumay Dublish, SiavashKatebzadeh and Vasilis Gavrielatos, for their valuable help and support, both technicaland otherwise.My time in Edinburgh has been made special with the friendship of Supriya andSidharth Kashyap. I thank them for their company and providing such a rich sourceof conversation, education and entertainment. They have been the family away fromhome.Finally, last but not the least, I would like to thank my parents, Popatlal andJyotsana, and sisters, Urvi and Ronak, for their endless love. I would not be who I amtoday without their enormous support and sacrifices throughout my life.As the submission of this thesis turns a new chapter in my life, I thank God forthe perfectly timed wonderful gift in the form of my little son, Mivaan. With him onmy side, I look forward to embark upon a new journey . . . viii edicated to my wife, Kruti. ix able of Contents xi .3.2 Adapting to Variability . . . . . . . . . . . . . . . . . . . . . . 353.3.3 Leeway with Cost-Efficient NRU . . . . . . . . . . . . . . . . 373.3.4 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.5 Cost and Complexity Analysis . . . . . . . . . . . . . . . . . . 403.3.6 Leeway for Multi-Core . . . . . . . . . . . . . . . . . . . . . . 423.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Workloads and Simulation Infrastructure . . . . . . . . . . . . 423.4.2 Evaluated Cache Management Techniques . . . . . . . . . . . 433.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5.1 Performance on Quad-Core Configurations . . . . . . . . . . 453.5.2 Performance Analysis on a Single-Core Configuration . . . . 473.5.3 Dissecting Performance of Hawkeye . . . . . . . . . . . . . . 493.5.4 Adaptivity of Leeway . . . . . . . . . . . . . . . . . . . . . . . 503.5.5 Sensitivity of Leeway-NRU on Number of NRU Bits . . . . . . 533.5.6 Measuring the Number of History Table Look-Ups . . . . . . 533.5.7 Reducing Storage Cost for Leeway . . . . . . . . . . . . . . . 543.6 Evaluation of Concurrent Techniques . . . . . . . . . . . . . . . . . . 543.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 xii .2.1 Objectives for High Performance Reordering . . . . . . . . . 725.2.2 Implications of Not Preserving Graph Structure . . . . . . . . 735.2.3 Limitations of Prior Skew-Aware Reordering Techniques . . . 745.3 Degree-Based Grouping (DBG) . . . . . . . . . . . . . . . . . . . . . . 775.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4.1 Graph Processing Framework, Applications and Datasets . . 805.4.2 Evaluation Platform and Methodology . . . . . . . . . . . . . 825.4.3 Evaluated Reordering Techniques . . . . . . . . . . . . . . . . 825.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.5.1 Performance Excluding Reordering Time . . . . . . . . . . . . 845.5.2 MPKI Across Cache Levels . . . . . . . . . . . . . . . . . . . . 865.5.3 Performance Analysis of Push-Dominated Applications . . . 875.5.4 Performance Including Reordering Time . . . . . . . . . . . . 895.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xiii Conclusions and Future Work 115 xiv hapter 1Introduction

The microprocessor industry has enjoyed four decades of exponentially growingtransistor budgets, enabling complex core microarchitectures, multi-core processors,and cache capacities reaching into tens of megabytes (MB) for commodity processors.The looming reality, however, is that Moore’s law is nearing its limits both in terms ofphysics and economics. Combined with the end of voltage scaling, the semiconductorindustry is entering a new phase where transistors become a limited resource and anew technology generation cannot be counted on to double them. This calls for a newregime in computer systems, one in which every transistor counts.Last-Level Cache (LLC) represents the bulk of a modern processor’s transistorbudget and is an essential feature of modern processors. Fig. 1.1 shows die photosof two modern processors showing LLC (labeled L3) occupying nearly the samearea as the processor cores. LLC has been instrumental in bridging the gap in thespeed of processor and memory via ever-larger capacities, providing performancegains across processor generations. In the future, however, further increases in cachecapacity may become a difficult proposition due to technology constraints. Thus,future processors will rely on effective cache management mechanisms and policiesto get more performance out of the scarce LLC capacity and minimize long latencymemory accesses.

Applications with large working set size often exhibit thrashing and/or streamingaccess patterns at LLC, leading to premature evictions of useful cache blocks that arelikely to be referenced in the near future. Meanwhile, a large fraction of LLC capacity Chapter 1. Introduction (a) Intel Broadwell E Core i7-6950Xfeaturing 10 cores and 25MB shared L3(2.5MB L3 slice per core) [35]. (b) AMD Zen microarchitecture CoreComplex (CCX) featuring 4 cores and8MB shared L3 (2 MB L3 slice per core) [11].

Figure 1.1: Die photos of modern processors highlighting floor area devoted todifferent components. is occupied by dead blocks that will eventually be evicted without incurring further hits,leading to inefficient utilization of the LLC capacity. Cache efficiency can be improvedsignificantly by identifying dead blocks and discarding them immediately after theirlast use, thereby providing an opportunity for cache blocks with long temporal reusedistances to persist in the cache longer and accumulate more hits.The state-of-the-art cache management techniques employ prediction mechanismsthat learn from the past access patterns with an aim to correctly identify as manydead blocks as possible. Effectiveness of these predictors hinges on the stability ofapplication behavior with respect to the metric used for determining whether theblock is dead. Naturally, the more consistent the reuse behavior across the block’slifetimes (also called generations ) in the cache, the more accurate the predictions.In practice, applications exhibit variability in the reuse behavior of cache blocks.The sources of variability are numerous such as microarchitectural noise (e.g.,speculation), control-flow variation, cache pressure from other threads and inherentapplication characteristics. These sources of variability are outside the control of LLC,making variability in the reuse behavior an inevitable challenge for a cachemanagement technique. Variability in reuse prediction challenges existing techniquesin reliably identifying the end of a block’s useful lifetime, thus causing lower .2. Our Proposals prediction accuracy, coverage or both. A wrong prediction may either causepremature eviction of a useful cache block, leading to an additional cache miss orcause delay in eviction of a dead block, leading to wastage of cache capacity. Thiscalls for cache management mechanisms and policies that can tolerate variability inthe reuse behavior of cache blocks to maximize cache efficiency. Aim of this thesis is:

To design robust cache management mechanisms and policies for LLC that minimizecache misses in the face of variability in the reuse behavior of cache blocks, whilekeeping the cost and complexity of the hardware implementation low .To that end, we propose two cache management techniques, one domain-agnosticand one domain-specialized, that introduce robust mechanisms and policies to addressvariability in reuse prediction. The rest of the chapter provides a brief overview ofboth proposals.

In this part of the thesis, we consider a conventional approach to cache management,namely domain-agnostic cache management, in which the LLC is managedcompletely in hardware. Such approach is quite attractive in practice as the cachemanagement remains fully transparent to the application software. There has been arich history of works that proposed various domain-agnostic techniques to improvecache efficiency [8, 18, 37, 39, 40, 54, 59, 63, 67, 69, 71, 73, 76, 78, 80, 81, 82, 85, 86, 87,88, 89, 97, 103, 110].The state-of-the-art techniques employ prediction mechanisms that seek tocorrectly identify as many dead blocks as possible and evict them immediately aftertheir last use to reduce cache thrashing. These predictors all rely on some metric oftemporal reuse to make their decisions regarding the end of a given block’s useful life.Previous works have suggested hit count [81], last-touch PC [73], and number ofreferences to the block’s set since the last reference [59], among others, as metrics fordetermining whether the block is dead at a given point in time. However, we observethat existing metrics limit the accurate identification of dead blocks in the face ofvariability. For example, when the number of hits to a cache block is inconsistent

Chapter 1. Introduction across generations, a technique relying on this metric (i.e., hit count) would eitherprematurely classify the cache block dead or may not classify the cache block deadaltogether until its eviction, both of which lead to cache inefficiency. This calls forrobust metrics and policies that can tolerate inconsistencies.To that end, we propose

Live Distance , a new metric of temporal reuse based onstack distance; stack distance of a given access to a cache block is defined as the numberof unique cache blocks accessed since the previous access to the cache block [112].For a given generation of a cache block (from allocation to eviction), live distance isdefined as the largest observed stack distance in the generation. Live distance providesa conservative estimate of a cache block’s useful lifetime.We introduce Leeway, a new domain-agnostic cache management technique thatuses live distance as a metric for dead block predictions. Leeway uses code-datacorrelation to associate live distance for a group of blocks with a PC that bringsthe block into the cache. While live distance as a metric provides a high degree ofresilience to variability, the per-PC live distance values themselves may fluctuate acrossgenerations. To correctly train live distance values in the face of fluctuation, we observethat an individual application’s cache behavior tends to fall in one of two categories:streaming (most allocated blocks see no hits) and reuse (most allocated blocks see oneor more hits). Based on this simple insight, we design a pair of corresponding policiesthat steer updates in live distance values either toward zero (for bypassing) or towardthe maximum recently-observed value (to maximize reuse). For each application,Leeway picks the best policy dynamically based on the observed cache reuse behavior.To avoid the need to access specialized external structures (e.g, prediction tables)upon each LLC access, Leeway embeds its prediction metadata (i.e., Live Distance)directly with cache blocks. This is in contrast with prior predictors [37, 39, 40, 73],which need to access a dedicated predictor table upon every single LLC access. Becausemodern multi-core processors feature distributed LLC, accesses to dedicated predictiontables introduce detrimental latency and energy overheads in traversing the on-chipinterconnect to query such structures.

In this part of the thesis, we identify applications for which existing domain-agnosticcache management techniques struggle in exploiting the high reuse due to variabilityarising from certain fundamental application characteristics. Specifically, we explore .2. Our Proposals applications from the domain of graph analytics processing natural graphs. Fornatural graphs, the vertex degrees follow a skewed power-law distribution, in whicha small fraction of vertices have many connections while the majority of verticeshave relatively few connections [6, 28, 61, 105, 106]. Such graphs are prevalent in avariety of domains, including social networks, computer networks, financial networks,semantic networks, and airline networks.The power-law skew in the degree distribution means that a small set of verticeswith the largest number of connections is responsible for a major share of off-chipmemory accesses. The fact that these richly-connected vertices, hot vertices , comprisea small fraction of the overall footprint while exhibiting high reuse makes them primecandidates for caching. Meanwhile, the rest of the vertices, cold vertices , comprise alarge fraction of the overall footprint while exhibiting low or no reuse.Despite the high reuse inherent in accesses to the hot vertices, graph applicationsexhibit poor cache efficiency due to the following two reasons: hot vertices are sparsely distributed throughout thememory space, exhibiting a lack of spatial locality. When hot vertices share the cacheblock with cold vertices, valuable cache space is underutilized. hot vertices inherently exhibit hightemporal reuse. However, the reuse patterns of graph-analytic applications is highlyirregular and is dependent on graph topology, which cause severe cache thrashingwhen processing large graphs. Accesses to a large number of cold vertices areresponsible for thrashing, often forcing hot vertices out of the cache.Both problems are orthogonal in nature as solving one problem does not solve theother. Overcoming the former problem requires improving cache block utilizationby focusing on intra-block reuse, whereas the latter problem requires retaining highreuse cache blocks in the LLC by focusing on inter-block reuse.The former problem is outside the scope of any cache management techniqueas it stems from the fact that vertex properties usually require just 4 to 16 bytes incomparison to 64 or 128 bytes of a cache block size in modern processors. Thus, theeffective spatial locality is completely dictated by the vertex layout in memory for agiven graph dataset, which is in complete control of the software.The latter problem is what a cache management technique targets. However, longreuse distances along with irregular access patterns impede learning mechanisms ofthe state-of-the-art domain-agnostic cache management techniques, rendering them Chapter 1. Introduction deficient for the entire application domain.We observe that the software not only has the knowledge of crucial applicationsemantics such as vertex degrees, but also controls the placement of vertices inmemory. Thus, cache management for graph analytics can be significantly improvedby leveraging software support.To that end, we propose a holistic approach of software-hardware co-designto improve cache efficiency for the domain of graph analytics processing naturalgraphs. Our software component implements a novel lightweight software technique,called Degree-Based Grouping (DBG), that applies a coarse-grain graph reordering tosegregate hot vertices in a contiguous memory region to improve spatial locality.Our hardware component implements Graph Specialized Cache Management(GRASP). GRASP augments existing cache insertion and hit-promotion policies toprovide preferential treatment to cache blocks containing hot vertices to shield themfrom thrashing. To cater to the variability in the reuse behavior, GRASP policies aredesigned to be flexible to cache other blocks exhibiting reuse, if needed.GRASP relies on lightweight software support to accurately pinpoint hot verticesamidst irregular access patterns, in contrast to the state-of-the-art domain-agnostictechniques that rely on storage-intensive prediction mechanisms. By leveragingcontiguity among hot vertices (enabled by DBG), GRASP employs a lightweightsoftware-hardware interface comprising of only a few configurable registers, whichare programmed by software using its knowledge of the graph data structure.The strength and novelty of our co-design lies in the interplay between software(DBG) and hardware (GRASP). Software aids hardware in pinpointing hot vertices viaa lightweight interface, thus eliminating the need for storage-intensive cache metadatarequired by the state-of-the-art domain-agnostic techniques. Meanwhile, hardware isresponsible for exploiting temporal locality in presence of cache thrashing, allowingsoftware to focus only on inducing spatial locality, enabling low-overhead softwarereordering compared to high-overhead complex software-only vertex reorderingtechniques that target both spatial and temporal locality. A holistic software-hardwareco-design enables high cache efficiency for graph analytics while keeping both softwareand hardware components simple.

Some of the contents of this thesis have appeared in the following publications: .4. Thesis Organization The publications appearing in

Chapter 3 :• P. Faldu and B. Grot. “LLC Dead Block Prediction Considered Not Useful”.

InInternational Workshop on Duplicating, Deconstructing and Debunking (WDDD),co-located with ISCA . 2016. [32]• P. Faldu and B. Grot. “Reuse-Aware Management for Last-Level Caches”.

InInternational Workshop on Cache Replacement Championship (CRC), co-locatedwith ISCA . 2017. [14]• P. Faldu and B. Grot. “Leeway: Addressing Variability in Dead-Block Predictionfor Last-Level Caches”.

In Proceedings of the International Conference on ParallelArchitectures and Compilation Techniques (PACT) . 2017. [15]The publication appearing in

Chapter 5 :• P. Faldu, J. Diamond and B. Grot. “A Closer Look at Lightweight GraphReordering”.

In Proceedings of the International Symposium on WorkloadCharacterization (IISWC) . 2019. [3]The publications appearing in

Chapter 6 :• P. Faldu, J. Diamond and A. Patel. “Cache Memory Architecture and Policiesfor Accelerating Graph Algorithms”. U.S. Patent 10417134. Oracle InternationalCorporation. 2019. [5]• P. Faldu, J. Diamond and B. Grot. “POSTER: Domain-Specialized CacheManagement for Graph Analytics”.

In Proceedings of the International Conferenceon Parallel Architectures and Compilation Techniques (PACT) . 2019. [4]• P. Faldu, J. Diamond and B. Grot. “Domain-Specialized Cache Managementfor Graph Analytics”.

In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA) . 2020. [1] .

Rest of the thesis is organized as follows: Chapter 2 presents the necessary backgroundon cache management techniques to understand the limitations of the state-of-the-art

Chapter 1. Introduction techniques. Chapter 3 presents the design and evaluation of Leeway, our domain-agnostic cache management technique.Chapter 4 highlights the limitations of domain-agnostic cache managementtechniques for the domain of graph analytics and motivates the need for asoftware-hardware co-design to manage LLC for graph analytics. The next twochapters present software and hardware components of the proposed co-design:Chapter 5 presents DBG, a new software vertex reordering technique to improvespatial locality and Chapter 6 presents GRASP, a domain-specialized cachemanagement that leverages DBG to further improve cache efficiency for graphanalytics. Finally, we conclude our proposals in Chapter 7 and provide potentialfuture directions of research for cache management. hapter 2Background

In typical desktop and server computers, the memory hierarchy is organized as severallevels of memories of different speeds and sizes. Each level of memory is biggerand cheaper per byte, but slower than the previous higher-level that is closer to theprocessor. Fig. 2.1 shows a three-level cache hierarchy along with the adjacent levels,including their typical access times and sizes. Fig. 2.2 shows a typical layout of a cachehierarchy in a modern multi-core processor. L1 and L2 caches are private per corewhereas L3, also called

Last-Level Cache (LLC) , is shared across processors. While forthe purpose of caching, L3 can be logically seen as a single structure, physically, L3 isorganized as multiple

Non-Uniform Cache Accesses (NUCA) slices [98] as shown in thefigure.

CPU Registers

L3 Cache (LLC)

Main Memory A cc e ss T i m e C a p a c i t y L1 Cache

L2 Cache ~100s GB~10s MB~100s KB~10s KB<1 ns ~1000s B

Figure 2.1: A typical memory hierarchy containing three levels of caches, includingtypically access times (on left) and typical sizes (on right) [16]. Chapter 2. Background

CPUCore L1-DL1-I L2 L3 Slice L3 Slice i L3 Slice i+1

L3 Slice N Core N L1-DL1-I L2 Core i+1

L1-DL1-IL2DRAM Core i L1-DL1-IL2

Figure 2.2: A typical layout of a modern multi-core processor with three levels of thecache hierarchy.

A cache hierarchy can be maintained as fully-inclusive , fully-exclusive or non-inclusive non-exclusive . A fully-inclusive level of cache must contain all the cacheblocks that are present in the previous higher-level cache. Conversely, a fully-exclusivelevel of cache must not contain any cache block that is present in the previous higher-level cache. Finally, a non-inclusive non-exclusive level of cache does not observeany such constrains, and it may or may not contain the cache blocks that are presentin the previous higher-level cache. Meanwhile, the main memory is inclusive of allthe cache levels, meaning memory stores all addresses regardless of whether they arepresent in any of the cache levels.During execution, a CPU core first queries the L1 cache for the data item (i.e., aprogram instruction or an application data) needed to perform computations. If thedata item is found (i.e., a cache hit), L1 responds to the request with the necessary data.Meanwhile, if the data item is not found (i.e., a cache miss), next lower-level cache isqueried. The process is repeated until the data item is found in one of the caches. Ifthe data item is not found in any caches, it will be retrieved from the main memory.Last-Level Cache (LLC) (i.e., L3 for a three-level cache hierarchy or L2 for a two-level cache hierarchy) is of particular interest as it acts as an on-chip frontier, missto which requires a long latency memory access. LLC offers the largest capacityamong all the on-chip caches, and thus can store the largest fraction of the working .1. Principle of Locality set of an application. However, the LLC capacity offered by modern processors issignificantly smaller than the working set size of emerging applications. Fortunately,most applications do not access all data items uniformly, meaning some data itemsare likely to be reused more frequently than others, due to an application propertyknown as locality as discussed below. Caches are designed to exploit the principle of locality observed in most applications.Two different types of locality have been observed: refers to locality in space, which states that the data items whoseaddresses are near one another tend to be referenced close together in time. refers to locality in time, which states that recently accesseddata items are likely to be accessed in near future.To exploit spatial locality, caches operate at a granularity of a unit called cacheblock (or cache line ), which consists of several bytes (typically, 64 or 128 bytes).When moving data between caches, an entire cache block containing the data item istransferred, in anticipation that the other nearby data items will be accessed soon dueto spatial locality.Temporal locality is exploited by caching the most recently accessed cache blocks.A widely popular cache management technique that achieves this is called

LeastRecently Used (LRU) . LRU maintains the cache blocks in a cache set as a recency stack .Cache blocks are ordered in the stack based on how recently they were accessed withthe

Most Recently Used (MRU) cache block at the top of the stack and the

Least RecentlyUsed (LRU) cache block at the bottom of the stack. When a cache set is full and a newblock must be inserted into this set, a cache block at the LRU position is evicted, inanticipation that other, more recently accessed, cache blocks will be accessed soondue to temporal locality. Fig. 2.3 depicts the functionality of LRU cache managementtechnique for three events: insertion, eviction and hit.

LRU can effectively exploit temporal locality. However, LRU is not an efficient cachemanagement technique for LLC as temporal locality of an application is often filtered Chapter 2. Background p p p p Hit Hit Hit Insertion Eviction Hit MRU LRU Recency Stack

Figure 2.3: LRU cache management for a 4-way associative cache. Circles labeled 𝑝 𝑖 show positions of the cache blocks in the recency stack with position 𝑝 for the MostRecently Used (MRU) cache block and 𝑝 for the Least Recently Used (LRU) cacheblock. The solid arrows point to the new positions for cache blocks for a given cacheevent, while the dotted arrows point to the new positions of cache blocks when otherblocks are placed into their positions. by the higher-level caches. As such, not all access patterns observed at LLC confirmto the principle of locality. Prior works listed three most common access patternsobserved at LLC, which are summarized in Table 2.1 [71, 86]. exhibits good temporal locality as the recentlyaccessed cache blocks are more likely to be accessed soon, making LRU perfectlysuitable for such patterns. has no temporal locality in its references. For strictlystreaming access patterns, LRU is no worse than any other cache managementtechnique as replacement decisions are irrelevant. However, LRU is inefficient whenLLC observes a mix access pattern that is a combination of streaming and some otheraccess patterns. Amidst the mix patterns, LRU inserts all cache blocks at the MRUposition. The cache blocks exhibiting streaming accesses are gradually propagated tothe LRU position, all the while occupying cache space, until eventually evicted fromthe LLC without incurring any cache hit, wasting valuable cache capacity. In contrast, Access Pattern Stream of cache accesses 𝑎 𝑖 to a given cache setRecency-friendly ( 𝑎 , 𝑎 , ..., 𝑎 𝑘−1 , 𝑎 𝑘 , 𝑎 𝑘 , 𝑎 𝑘−1 , ..., 𝑎 , 𝑎 ) 𝑁 , for 𝑘 > 0 and N > 0Streaming ( 𝑎 , 𝑎 , ..., 𝑎 𝑘 ), for 𝑘 > 0Thrashing ( 𝑎 , 𝑎 , ..., 𝑎 𝑘 ) 𝑁 , for 𝑘 > associativity and N > 1 Table 2.1: Common cache access patterns at LLC. .2. Cache Access Patterns m c f l b m s op l e x li bq m il c o m ne t pp ge m s f s ph i n x b w a v e s a s t a r l e s li e x a l an w r f c a c t u s g cc z eu s b z i p pe r l t on t o h m m e r h264 r e f dea l gob m k g r o m a cs s j eng na m d c a l c u li x po v r a y ga m e ss a v g M PK I LRU OPT

Figure 2.4: Misses Per Kilo Instructions (MPKI) for SPEC CPU 2006 applicationsunder LRU and OPT cache management techniques for 16-way associative 2MB LLC.Applications on x-axis are sorted by the MPKI under LRU. the optimal cache management technique may insert all these cache blocks in theLRU position or may bypass their cache insertions altogether and directly forwardthem to the higher-level caches. is a cyclic access pattern of length 𝑘 , when 𝑘 is greaterthan the associativity of a cache. LRU is inadequate for such an access pattern as LRUreceives zero cache hit for such patterns. These access patterns present a pathologicalcase for LRU as LRU tries to retain the entire working set in the cache, and ends withzero hit. In contrast, the optimal cache management technique may retain a partialworking set in the cache and may observe cache hits for a fraction of cache accesses.In practice, applications exhibit access patterns that are some combination of the aboveaccess patterns, thus offering significant room for improving cache efficiency over atraditional cache management technique like LRU.To quantify the maximum opportunity in eliminating misses over LRU, wesimulate LLC under Belady’s OPT [114], an offline optimal replacement technique thathas the perfect knowledge of the future. OPT replaces a cache block whose nextreference is farthest in the future among the cache blocks in a given set. While OPT isimpractical to implement, it provides a theoretical upper bound on the number ofmisses a cache management technique can eliminate. Fig. 2.4 plots the

Misses Per KiloInstructions (MPKI) for OPT as well as the baseline LRU for all 29 SPEC CPU 2006applications. OPT is able to eliminate 26% (max 67%) of misses on average over LRU,highlighting a significant opportunity in improving the cache efficiency over LRU. Inthe following sections, we explain the basics of cache management techniques Chapter 2. Background followed by a discussion on the most relevant prior cache management techniques.

The goal of a cache management technique is to decide which cache blocks to retainin the cache in order to minimize cache misses (or equivalently, maximize cache hits).Therefore, the efficiency of a cache management technique depends on how effectivelyit answers the following question:

Which cache block in a given cache set is the leastlikely to be accessed soon, and thus should be replaced when a new cache block is insertedin the set?

An offline technique like OPT can provide the optimal answer by lookinginto the future accesses. However, a practical cache management technique does notknow the future LLC accesses, and thus relies on a heuristic that predicts reuse ofcache blocks by analyzing the past LLC accesses.A typical cache management technique maintains relative priorities of the cacheblocks in a given cache set. Priority of a cache block reflects how likely it is going to bereused in the near future under a given heuristic. Priorities may be adjusted on certaincache events such as cache hits or cache misses. Overall, every cache managementtechnique implements three policies, each defining how to adjust the priorities ofcache blocks for a corresponding cache event. is responsible for assigning priority of a new cache block, wheninserted in the cache due to a miss. Meanwhile, insertion policy may also adjust thepriorities of other cache blocks already present in the cache set. In some cases, theinsertion policy may choose to bypass the insertion altogether by forwarding datadirectly to the higher-level caches, if the existing cache blocks in the set are morelikely to be reused in comparison to the new cache block.For example, the insertion policy of LRU assumes that the application exhibits arecency-friendly access pattern and thus, a newly inserted cache block is likely tobe accessed soon. Based on this assumption, LRU never bypasses the insertion andalways assigns the highest priority to a new cache block by inserting it at the MRUposition. Before inserting a new cache block, insertion policy shifts every cache blockby one position towards the LRU position in the recency stack as shown using thedotted arrows in Fig. 2.3. is responsible for choosing which cache block to replace for acase when the insertion policy decides to insert a new cache block in the cache set and .4. Prior Cache Management Techniques the set is full. If a technique supports multiple cache blocks to have the same priority,the eviction policy also defines a tie-breaker logic.For example, as LRU maintains cache blocks in the recency stack using a totalorder, no two cache blocks can have the same priority. Thus, the eviction policy ofLRU simply chooses a cache block at the LRU position as a replacement candidate. is responsible for adjusting the priority of a cache blockupon hit. Meanwhile, hit-promotion policy may also adjust the priorities of othercache blocks already present in the cache set.For example, the hit-promotion policy of LRU assumes that the application exhibitsa recency-friendly access pattern and thus, a recently accessed cache block is likelyto be accessed soon. Based on this assumption, LRU promotes the cache block to theMRU position, regardless of its current position in the recency stack. Meanwhile, thecache blocks between the MRU position and the position of the cache block before thepromotion are shifted one position towards LRU. There has been a rich history of cache management techniques to improve cacheefficiency [8, 10, 18, 37, 39, 40, 53, 54, 59, 63, 67, 69, 71, 73, 76, 78, 80, 81, 82, 85, 86, 87,88, 89, 91, 97, 99, 100, 103, 107, 110]. Based on the amount of state maintained by theheuristic employed by a cache management technique and how the state is updated,existing cache management techniques can be broadly classified into the followingfour categories. apply static policies for insertion, eviction and hit-promotion.Such techniques maintain a local state per cache block by augmenting each cacheblock with a few bits. Local state (e.g., recency state under LRU) is used to maintainrelative priorities of cache blocks within a cache set under some heuristic. The localstate of a cache block is only relevant during its current generation, which is defined asthe time between insertion and eviction of the cache block; the local state is reset whenthe cache block is replaced with a new cache block. The static techniques providefundamental building blocks for more advanced cache management techniques asdiscussed next. apply dynamic policy for at least one of threecache events – insertion, eviction or hit-promotion. Such techniques are built on top Chapter 2. Background of static cache management techniques, and thus, like static techniques, maintain alocal cache state in the LLC for each cache block. Additionally, these techniques alsomaintain some state outside the cache, which is referred to as the external state. Theexternal state is usually minimal, and hence the name lightweight. apply dynamic policies for cachemanagement based on historical access patterns. In addition to the local state for eachcache block, these techniques also record information pertaining to the reuse of thecache blocks beyond their current generations in some external structure(s). As aresult, these techniques require significantly more storage than the lightweightdynamic techniques. apply dynamic policies for cache management, whichrely on software to identify high-reuse cache blocks. For each cache access, softwareprovides some sort of a reuse hint for hardware to make policy decisions.In the rest of the chapter, we discuss each of these classes in detail.

Static cache management techniques employ static policies for insertion, eviction andhit-promotion, which disregard the reuse of cache blocks in their previous generations.A static technique may maintain a local state for each cache block during its currentgeneration, which is reset when the cache block is evicted and replaced by anothercache block.LRU is a classic example of static techniques, which maintains how recently a cacheblock is accessed relative to the other cache blocks in a given cache set and makespolicy decisions exclusively based on that information. For example, the insertionpolicy of LRU always assigns a new cache block the highest priority by inserting it atthe MRU position, regardless of its reuse in the previous generations. Similarly, thehit-promotion policy promotes a cache block to the MRU position on a hit regardlessof the number of hits the cache block may have observed in the current or previousgenerations. Finally, the eviction policy always evicts a cache block at the LRU position,regardless of the number of hits incurred by the cache block in the current generationor previous generations. Other example of static techniques include PseudoLRU [110],LIP [86], SRRIP [71], Static GIPPR [54] and Static MDPP [39], among others. .4. Prior Cache Management Techniques Cache Block

Way Way Way n-1

Way n Set Set Set Set Set Set m-4

Set m-3

Set m-2

Set m-1

Set m ... ... ... ...LLC RS Tag + CS DataCache Block Legends

RS: Recency StateCS: Coherence State

Figure 2.5: A set-associative cache with 𝑛 ways and 𝑚 sets. A static technique requires 𝑘 -bits per cache block to maintain recency state, where 𝑘 is typically between and 𝑙𝑜𝑔 𝑛 . In comparison, D bytes (typically 64 or 128 bytes) are allotted for data whereasTag requires 𝐴 − 𝑙𝑜𝑔 𝑚 − 𝑙𝑜𝑔 𝐷 bits, where 𝐴 is the number of bits needed to representan address. Static techniques typically maintain between and 𝑙𝑜𝑔 𝑛 bits of a local state (usually arecency state) per cache block, where 𝑛 is the associativity of the cache. Fig. 2.5 showsa logical organization of LLC along with the storage devoted to a recency state, tagand data for each cache block. Static techniques typically require the least amount ofstate per cache block, as other techniques are built on top of a static technique(s). Static cache management techniques target specific access patterns, and cannot adaptto application behavior due to the static nature of their policies. For example, LRUtargets recency-friendly access patterns. However, LRU is not suitable to addressthrashing or streaming access patterns as explained in Sec. 2.2.Prior work proposed

LRU Insertion Policy (LIP) [86], that makes a simplemodification to the insertion policy of LRU to target streaming access patterns. LIP isidentical to LRU except for its insertion policy, which assigns a new cache block thelowest priority by inserting it at the LRU position, in anticipation of streaming accesspatterns. Under LIP management, the cache blocks that do not exhibit any reuse areevicted from the cache soon after their insertion, thus minimizing cache pollution forapplications dominated by streaming access patterns. However, due to the staticnature of the policies, LIP, as a standalone technique, is not suitable forrecency-friendly access patterns. Chapter 2. Background

Lightweight dynamic cache management techniques employ dynamic policy for atleast one of the three cache events of insertion, hit-promotion and eviction [54, 69,71, 80, 82, 86, 89]. A lightweight dynamic technique may maintain some externalstate, in addition to maintaining a local state per cache block. Policy decisions areinfluenced by a combination of the local state for the cache blocks in a given set andthe external state. Therefore, two cache blocks with an identical recency state may betreated differently based on an external state.A lightweight dynamic technique is typically constructed by composing a fewtechniques, each of which is either a static technique or another lightweight dynamictechnique. Thus, unlike static techniques, lightweight dynamic techniques can adaptto application behavior.For example,

Bimodal Insertion Policy (BIP) [86] is composed of two statictechniques, LRU and LIP. BIP dynamically selects between LRU and LIPprobabilistically, wherein LRU is chosen with a low probability. Thus, BIP inserts anew cache block at the MRU position with a low probability and at the LRU positionwith a high probability. Thus, a new cache block’s insertion priority is dynamicallydecided based on the external state (e.g., a pseudo random number generator or asaturating counter) at the time of insertion.BIP is able to target certain thrashing access patterns for which neither of itsconstituent techniques (i.e., LRU and LIP) alone is suitable. Consider an access patternto a particular set of the form ( 𝑎 , 𝑎 , ..., 𝑎 𝑘−1 , 𝑎 𝑘 ) 𝑁 followed by ( 𝑏 , 𝑏 , ..., 𝑏 𝑘−1 , 𝑏 𝑘 ) 𝑁 ,where k is greater than the cache associativity and N is greater than 1. For such accesspatterns, LRU is not suitable for either of the streams and would incur zero hit forboth the streams. LIP also struggles as it won’t be able to adapt to the change in theworking set from the stream 𝑎 𝑖 to the stream 𝑏 𝑖 and would incur zero hit for the secondstream of accesses as all the new cache blocks from the stream 𝑏 𝑖 will be inserted at theLRU position and evicted immediately after their insertion without incurring any hits.In contrast, BIP can adapt to the change in the working set by dynamically switchingbetween LRU and LIP. For BIP, some cache blocks of the stream 𝑏 𝑖 are inserted at theMRU position, thus allowing them to persist in the cache longer to incur further hits.Meanwhile, the rest of the cache blocks are inserted at the LRU position, thus reducingcache thrashing.Another example of a lightweight dynamic technique is Dynamic Insertion Policy .4. Prior Cache Management Techniques FollowerSampler ASampler B

Way Way Way n-1

Way n Set Set Set Set Set Set m-4

Set m-3

Set m-2

Set m-1

Set m ... ... ... ...LLC SamplerSetsPolicy ASamplerSetsPolicy B FollowerSetsSaturatingCounter PolicySelection Miss +- Miss

Set-Dueling Mechanism

Figure 2.6: The figure shows a dynamic technique composed of two techniques, Aand B. A small number of sampled sets, Sampler Sets, implement technique A and anequal number of some other sets implement technique B. Remaining sets, FollowerSets, implement the winning technique based on the value of the saturating counter. (DIP) [86], which is composed of LRU, a static technique, and BIP, a lightweight dynamictechnique. DIP chooses between LRU and BIP based on the observed access pattern,and thus DIP is suitable for application exhibiting any of the three – recency-friendly,streaming and thrashing – access patterns.DIP introduced a set-dueling mechanism for the policy selection. DIP allocatesa small number of cache sets (called sampler sets ) which are exclusively managedunder the LRU technique. An equal number of other sampled sets are exclusivelymanaged under the BIP technique. DIP maintains a saturating counter (i.e., an externalstate) outside the cache to track the difference in misses due to each technique. DIPdynamically selects a technique that causes fewer misses, and manages the rest of thesets (called follower sets ) using the most effective technique for a given access pattern.RRIP is the state-of-the-art lightweight dynamic technique [71]. RRIP isfundamentally very similar to DIP. However, RRIP is practically more attractive thanDIP as RRIP does not rely on LRU for a base static technique. RRIP maintains thecache blocks in a given set in only 𝑘 unique recency classes ( 𝑘 is typically smallerthan the associativity 𝑛 ), thus requiring 𝑙𝑜𝑔 𝑘 bits per cache block. In comparison,LRU maintains the cache blocks in a set in total order ( 𝑛 unique recency classes),which requires 𝑙𝑜𝑔 𝑛 bits per cache block. As a lightweight dynamic technique is composed of static techniques, it also maintainsa local state per cache block as required by the base static techniques. Finally, atechnique also maintains an external state that guides the dynamic policy selection. Chapter 2. Background

For example, the set-dueling mechanism of DIP requires a saturating counter to keeptrack of the winning policy between LRU and BIP as shown in Fig. 2.6.

A lightweight technique can adapt to application behavior and dynamically selectthe policy best suited for the application at a given time. However, due to minimalexternal state, a lightweight technique cannot provide fine-grain cache managementfor different streams, when each of the stream exhibits diverging access patterns.Consider an example of two streams 𝑎 𝑖 and 𝑏 𝑖 , wherein 𝑎 𝑖 exhibits a streaming accesspattern and 𝑏 𝑖 exhibits a recency-friendly access pattern. Also assume that the accessesfrom both streams are interleaved. A lightweight technique may apply a policy that issuitable for the access pattern that dominates the cache misses (e.g, apply LIP for bothstreams if 𝑎 𝑖 dominates or apply LRU for both streams if 𝑏 𝑖 dominates). Consequently, alightweight technique is unable to manage individual streams. In contrast, the optimaltechnique may apply policy individually for each stream (e.g., by managing 𝑎 𝑖 underLIP and 𝑏 𝑖 under LRU), showing significant opportunity in improving cache efficiencyby applying fine-grain cache management for individual streams. History-based predictive techniques implement dynamic policies that identify deadblocks (or conversely, useful blocks) based on historical access patterns [8, 18, 37, 39, 40,67, 73, 81, 85, 97]. These techniques encode reuse information of cache blocks beyondtheir current generations in some external structure, for subsequent recall when thecache blocks are accessed again. External state maintained by these techniques isoften non-trivial, unlike that of lightweight dynamic techniques.Majority of history-based techniques encode reuse information in an externalstructure called history table(s). To avoid the prohibitive storage costs of trackingindividual cache blocks, these techniques use a single entry in the history table toencode reuse information for a set of cache blocks that are likely to exhibithomogeneous reuse. For example, prior works have used different correlatingfeatures such as the sequence of memory access instruction addresses (PCs) leading toa block’s access, the single PC accessing a cache block and starting address of a fixedsize memory region containing a cache block [67, 73, 102, 104].History-based techniques can provide fine-grain cache management by adapting .4. Prior Cache Management Techniques their policies for individual access-streams as we explain below using the example ofthree state-of-the-art history-based predictive techniques. SHiP [67] leverages PC-correlating reuse behavior by adapting its policies at per-PC granularity. Each PC is classified as Streaming-PC or Reuse-PC. If cache blocksinserted by a particular PC are evicted without incurring any reuse, the PC is classifiedas Streaming-PC. Any other PC is classified as Reuse-PC. For Streaming-PCs, SHiPapplies policy suitable for streaming access patterns. For Reuse-PCs, SHiP appliespolicy suitable for recency-friendly access patterns.

Sampling Dead Block Predictor (SDBP) [73] leverages PC-correlating reusebehavior by aiming to detect the last access to a cache block, i.e., the instance atwhich a cache block becomes dead . Each PC is classified as Last-PC or Not-a-Last-PC.If cache blocks accessed by a particular PC are evicted without incurring a furtherreuse, the PC is classified as Last-PC. All other PCs are classified as Not-a-Last-PC. Acache block accessed by any Last-PC is predicted dead and its priority is set to thelowest to make it the immediate candidate for eviction; if a cache access by a Last-PCleads to a cache miss, the corresponding cache insertion may be bypassed byforwarding data directly to the higher-level caches. Meanwhile, cache blocks accessedby Not-a-Last-PC are managed under a simple static cache management technique.

Hawkeye [37] is the state-of-the-art technique that relies on PC-correlating reusebehavior. Hawkeye simulates Belady’s OPT [114] on past cache accesses and basedon the policy decisions taken by OPT, it classifies each PC as cache-averse or cache-friendly. Cache blocks accessed by PCs tagged as cache-averse are made the immediatecandidates for eviction. Meanwhile, other cache blocks are managed under a simplestatic cache management technique.All three history-based techniques discussed above exploit some form of PC-correlating reuse, which is one of the most commonly used correlating features amongprior history-based techniques. We also note that SHiP also proposed leveragingmemory region as another correlating feature, in which SHiP adapts its policies atper region granularity and all cache blocks belonging to the same memory region aremanaged under the same policy.

Local state:

As history-based predictive techniques are typically built on top of simplestatic or lightweight dynamic techniques, these also maintain a local state per cache Chapter 2. Background

Way Way Way n-1

Way n Set Set Set Set Set Set m-4

Set m-3

Set m-2

Set m-1

Set m ... ... ... ...LLC History Table Update Reuse Information Reuse Prediction Sampler Follower Figure 2.7: A history-based predictive technique employs history table that encodesreuse information of cache blocks. History table is updated with the reuseinformation observed for cache blocks (potentially, for only cache blocks from thesampler sets). History table is queried to make reuse prediction for a cache blockfrom any set. block as required by the base technique.

Reuse information:

The cache blocks may be augmented with additional stateneeded to encode reuse information, using which the history table is trained. Toreduce the need to update history table frequently, only cache blocks belonging to asmall number of preconfigured cache sets (called sampler sets ) may be used to trainthe history table. Thus, the only cache blocks that belong to the sampler sets requireadditional storage.

Embedded prediction metadata:

Some history-based techniques may also embedprediction metadata in every cache block. Prediction metadata is updated for a cacheblock on insertion and, potentially, on every subsequent hits. For example, SDBP uses1-bit per cache block to indicate if a cache block is predicted dead, which is updatedon every access to the cache block.

External prediction metadata:

History-based techniques encode predictionmetadata in some external structure, usually, history table, as shown in Fig. 2.7. Forexample, prior techniques employ history tables with 10s of KB of storage per core for1MB LLC [37, 67, 73].

PC-based reuse correlation:

A large fraction of history-based techniques rely on PC-based correlation to make reuse predictions as code-data correlation generally enables .4. Prior Cache Management Techniques higher accuracy predictions than other features [8, 18, 37, 39, 40, 67, 73, 81]. Indeed,all seven techniques [13, 14, 17, 19, 24, 25, 27] presented at the Cache ReplacementChampionship’17 [20] rely on some form of a PC-based reuse correlation. Therefore,these techniques need to pass PCs through the load-store queue and all the levelsof a cache hierarchy, requiring extra logic, wiring and energy consumption. This ispartially mitigated by storing only a hash of a PC, which requires only a fraction ofbits compared to the whole PC (e.g., 14-bits for a PC hash vs 48-bits for a full PCaddress). Nevertheless, it still poses a significant challenge for commercial processorsto implement PC-based techniques [21]. Reuse correlating features:

History-based techniques use correlating features (e.g.,PC-based reuse correlation) to reduce the storage cost of history tables. Use of acorrelating feature also helps train history table faster, when the reuse behavior forall cache blocks mapped to the same history table entry is similar. However, whenthe reuse behavior diverges for the cache blocks mapped to the same entry, it leads toa pathological case for prediction. Consider an example when a non-trivial fractionof cache blocks accessed by a PC exhibit high reuse, but the rest of the cache blocksaccessed by the same PC exhibit no reuse. In such a case, history-based technique thatrelies on PC-correlating reuse may struggle in reliably identifying high-reuse cacheblocks from the no-reuse cache blocks.

History table look-ups:

SHiP relies on history-based predictions for only itsinsertion policy. For SHiP, every new cache block is inserted in the cache afterquerying the history table. In contrast, SDBP and Hawkeye rely on history-basedpredictions for insertion as well as the hit-promotion policy. Thus, the history table isqueried on all cache accesses (including cache hits), which puts history table look-upson the critical path as a look-up may increase the latency of a cache hit. Such criticalpath look-ups are even more undesirable in a modern multi-core processor with aNUCA LLC (as shown in Fig. 2.2), as each LLC hit under these techniques requireaccessing the PC-indexed history table that might be located elsewhere on a chip,incurring latency, energy, and traffic overheads due to the need to traverse theon-chip network.

Software-aided cache management techniques rely on software hints to identify whichcache blocks are likely to exhibit high reuse [10, 53, 91, 99, 100, 107]. For cache Chapter 2. Background management, these techniques typically rely on a lightweight dynamic technique.However, the policy selection is guided by the software, unlike lightweight dynamictechniques that rely on some hardware mechanisms such as set-dueling.For example, Pacman [53] communicates a 1-bit hint with every memory accessto guide whether a cache block should be inserted at the MRU position or the LRUposition. Pacman optimizes loop code using runtime profiling over multiple trainingruns as follows. During training, Pacman analyzes access patterns of memory addresseswithin a loop that are dependent on the loop index variable and attempts to find acorrelation between the loop index and the reuse distance of the memory accesses. Ifit finds a linear correlation, the loop is split into two with all memory accesses in oneloop are tagged with a non-temporal hint (e.g., LRU hint) and all memory accessesfrom the other loop are tagged with a temporal hint (e.g., MRU hint). Overall, thecache management under Pacman is very similar to that of DIP, except that Pacmanrelies on software to select between LRU or MRU insertion position whereas DIP relieson a lightweight hardware mechanism.XMem [10], a recently proposed software-aided technique, relies on pinning-basedcache management for applications that benefit from cache tiling. Pinned cache blocksare protected from eviction until explicitly unpinned by the software, usually donewhen the tile is fully processed. XMem dedicates 75% of LLC capacity for pinningcache blocks that belong to the tile whereas the remaining capacity is managed bysome other hardware-only cache management technique.

As software-aided techniques are typically built on top of lightweight dynamictechniques, these also maintain a local state per cache block. Additionally, thesetechniques require nominal additional state, if any. For example, Pacman does notrequire any additional state whereas XMem requires 1-bit per cache block to identifywhether a cache block is pinned.

Custom interface:

Software-aided techniques, unlike other techniques discussedso far, are not completely transparent to the software, and thus require additionalhardware support for software to communicate hints. For example, Pacman proposedchanges in the

Instruction Set Architecture (ISA) by embedding load/store instructions .5. Summary with 1-bit reuse hint to guide cache management policies. Meanwhile, XMem proposedregion-based interface as follows: XMem supports custom cache management for 𝑛 different memory regions. For each memory region, XMem hardware exposes a pairof registers, with each pair is required to be populated by software with the boundsof the region of interest. Software also sets the reuse hint for each memory region itpopulates to indicate whether the cache blocks from a given region should be pinned. Limited scope:

The majority of prior software-aided techniques rely on compileranalysis and/or runtime profiling to provide software hints. For example, Pacmanonly optimizes loops with regular access patterns, and thus may not be effective forapplications dominated by irregular access patterns (e.g., indirect memory accesses ofgraph analytics), making such techniques difficult to apply for a broad spectrum ofapplications.

In this chapter, we provided background on cache management techniques necessaryto understand our contributions in the following chapters. We also provided a broadclassification of existing cache management techniques depending on the state neededby their heuristics, which is summarized in Table 2.2.Static techniques require the least amount of state, a few bits per cache block,among all classes of techniques. While standalone static techniques are the leasteffective in addressing complex access patterns at LLC, these techniques serve asbuilding blocks for more advanced dynamic techniques.Lightweight dynamic techniques build on top of static techniques and requirenominal additional state. These techniques provide significant value-addition overstatic techniques by dynamically adapting to the the observed access patterns.However, due to limited state, lightweight techniques are unable to provide fine-graincache management for individual access-streams.History-based predictive techniques are the state-of-the-art in cache managementthat provide fine-grain cache management by adapting their policies according tothe access patterns of individual access-streams. However, these techniques requirenon-trivial storage to maintain state in external structure(s), whose accesses may fallon the critical path of cache accesses.Software-aided techniques can provide more accurate identification of high-reusecache blocks as opposed to the hardware-only techniques for some applications. Chapter 2. Background

Technique State Within Cache External State Software Support?Static Recency State - -LightweightDynamic Recency State Nominal -History-based Recency State +Reuse Information +Embedded PredictionMetadata History Table(s) -Software-aided Recency State - ISA Extension

Table 2.2: Overview of state required for various classes of cache managementtechniques.

However, these techniques may require changes in the existing ISA. Finally, existingproposals target a set of applications with specific properties (e.g., tile-based algorithmsor loops with regular access patterns).Overall, history-based techniques and software-aided techniques generally manageLLC more efficiently than static or lightweight dynamic techniques. Unsurprisingly,to provide higher efficiency, these techniques also require more hardware (e.g., historytable or new ISA extensions). However, the cost of additional hardware is usuallyinsignificant in comparison to the LLC. For example, the storage requirement ofa history table is less than 2% of the LLC for the state-of-the-art history-basedtechniques [37, 67, 73]. hapter 3Leeway – Domain-Agnostic CacheManagement

History-based predictive techniques (also known as

Dead Block Predictors or DBP ) havebeen shown to be effective in improving LLC efficiency through better utilization ofexisting capacity [37, 39, 40, 67, 73, 81]. These schemes all rely on some metric oftemporal reuse to make their decisions regarding the end of a given block’s usefullife. Previous works have suggested hit count [81], last-touch PC [73], and number ofreferences to the block’s set since the last reference [59], among others, as metrics fordetermining whether the block is dead at a given point in time. By identifying andevicting dead blocks in a timely and accurate manner, these schemes allow other blocks(that have not exhausted their useful life) to persist in the cache and see further hits.The task of a DBP is complicated by the fact that applications often exhibit variability in the reuse behavior of cache blocks. The sources of variability arenumerous, stemming from microarchitectural noise (e.g., speculation), control-flowvariation, cache pressure from other co-running applications, etc. The variabilitymanifests itself as an inconsistent behavior of the individual cache blocks fromone cache generation (from allocation to eviction) to the next. This inconsistencychallenges DBPs in reliably identifying the end of a block’s useful lifetime, thusresulting in lower prediction accuracy, coverage, or both.A DBP requires metrics and policies that can tolerate inconsistencies. To that end,we propose

Live Distance , a new metric of temporal reuse based on

Stack Distance .Stack distance for a cache reference to a given cache block is defined as the number of

Chapter 3. Leeway – Domain-Agnostic Cache Management unique cache blocks accessed since the previous reference to the cache block [112].For a given generation of a cache block, live distance is then defined as the largestobserved stack distance in the generation. Live distance is an efficient way to representa block’s range of temporal use and, as we argue in Sec. 3.2.3, has a number of usefulproperties that make it attractive for dead block prediction in the face of variability.We introduce Leeway, a new DBP that uses live distance as a metric for prediction.Leeway uses code-data correlation to associate live distance for a group of blocks witha PC that brings the block into the cache. While live distance as a metric provides ahigh degree of resilience to variability by conservatively capturing a block’s temporalreuse, the per-PC live distance values themselves may fluctuate across generations. Tocorrectly train live distance values in the face of fluctuation, we observe that individualapplications’ cache behavior tends to fall in one of two categories: streaming (mostallocated blocks see no hits) and reuse (most allocated blocks see one or more hits).Based on this simple insight, we design a pair of corresponding policies that steerupdates in live distance values either toward zero (for bypassing) or toward themaximum recently-observed value (to maximize reuse). For each application, Leewaydynamically picks the best policy based on the observed reuse behavior at LLC.To avoid the need to access specialized external structures (e.g, predictor or historytable) upon each LLC access, Leeway embeds its prediction metadata (i.e., live distance)directly with cache blocks. This is in contrast with prior predictors [37, 39, 40, 73],which need to access a dedicated history table upon every single LLC access. Becausemodern multi-core processors feature distributed NUCA LLC, accesses to dedicatedhistory tables introduce detrimental latency and energy overheads in traversing theon-chip interconnect to query such structures.We study cache management techniques on various deployment configurations, andmake the following contributions:• We propose Leeway, a dead block predictor for LLC that introduces a newmetric, Live Distance, to track a block’s useful lifetime in the cache. To providehigh performance in the face of variability, Leeway deploys novel reuse-awareupdate policies that steer live distance values to maximize either bypass or reuseopportunities based on the application preference.• Leeway embeds prediction metadata in the cache, and thus accesses historytable only on misses, keeping the table look-ups off the critical path. This is incontrast to prior DBPs that access history tables on all cache accesses (including .2. Motivation cache hits).• We compare Leeway to prior cache management techniques for LLC,demonstrating that Leeway consistently provides good performance thatgenerally matches or exceeds that of state-of-the-art approaches. DBPs aim to improve cache behavior by identifying dead blocks and discarding themshortly after their last use, thereby providing an opportunity for blocks with longtemporal reuse distances to persist. Effectiveness of a dead block prediction hinges onthe stability of application behavior with respect to the metric used for determiningwhether the block is dead. Naturally, the more consistent the reuse behavior acrossthe block’s generations in the cache, the more accurate the predictions.In practice, there are many reasons for why a block’s live time may vary acrossgenerations, including:

Control flow variation:

When the memory reference instruction is predicated on acondition whose behavior varies at runtime, the corresponding cache block might bereferenced a different number of times across generations based on the predicate.

Microarchitectural noise:

This includes references on a mispredicted control flowpath and hits in lower-level caches due to conflicts in higher-level caches.

Shared data:

When a block is shared by multiple threads, it might see differentreference patterns due to runtime dynamics and scheduler decisions.

Cache pressure:

An application behavior may be consistent but due to cache pressurein the presence of co-running applications, a block may be prematurely evicted.As a result, the block would observe fewer references in a prematurely terminatedgeneration than it would otherwise.

Application characteristics:

An application may inherently exhibit irregularbehavior, leading to inconsistent access patterns for cache blocks. For example, forgraph processing applications, reuse patterns of accesses to vertices are dependent onthe graph topology. Specifically, the number of times a vertex is accessed depends onthe number of edges connected to the vertex and the reuse distance of an access Chapter 3. Leeway – Domain-Agnostic Cache Management PC 𝑖 : Ld X. . .PC 𝑣 : Beq cond, SKIPPC 𝑤 : Ld XSKIP: Listing 3.1: A code snippet showing potential variability in the reuse behavior ofreference X due to a data-dependent branch. depends on the number of other vertices and edges accessed since the previous accessto the same vertex.Our insight is that the ability of a DBP to tolerate inconsistency across generations hinges on the choice of the metric used for making the predictions. Spurred by theobservation, we next use a simple taxonomy to understand the space of metrics.

Fundamentally, all DBPs require a metric for determining when a block has reachedthe end of its useful life. Existing metrics can be classified broadly into two categories: direct and indirect . Also known as event-based metrics , these rely on monitoringaccesses to the block in order to detect the final access based on previously observedbehavior. Reference count [81], trace signature of instructions referencing a block [102,104], and last-touch PC [73] are all examples of direct metrics used by previouslyproposed DBPs. An advantage of direct metrics is that a block’s fate is determinedexclusively by accesses to itself, thereby shielding the decision-making mechanismfrom noise due to accesses to other blocks.The downside of direct metrics is their inflexibility in the face of inconsistent behavior, which we define as any variation from one generation of a block to the next.Consider a simple code snippet shown in Listing 3.1, which shows a reference to acache block holding the variable X, followed by a predicated second reference to X.Assuming that the second reference occurs only a fraction of the time due to thedata-dependent nature of the predicate, predictors that rely on direct metrics are facedwith two choices: (1) predict the block dead after the first reference, incurring a miss ifthe predicate resolves to False; or (2) predict the block dead after the second reference,which may never occur if the predicate resolves to True, and thus the prediction is .2. Motivation Figure 3.1: Variability for a PC being the last touch or not in h264ref never made. Alas, none of the options are satisfying, as they reduce either accuracy orcoverage of the predictions.Fig. 3.1 demonstrates such behavior for the last-PC metric used by SDBP [73] in h264ref , one of the SPEC CPU 2006 applications, for a PC responsible for 37% of themisses. The behavior captured in the figure is representative of the entire execution;for clarity, however, the figure shows only a sample of 250 consecutive cache referencesby that PC (X axis). For each reference, the Y axis shows whether the reference is,indeed, the last access to the block or not under the LRU cache management technique.For the last-PC metric to be useful in identifying dead blocks upon a last access to them,this behavior should be consistent, with all points falling on either the Last-Access(indicating dead blocks) or Not-a-Last-Access (indicating live blocks) line. Meanwhile,the fluctuation shown in the figure indicates that the predictor using last-PC metricmay struggle in accurately determining the end of a useful lifetime for blocks touchedby this PC.

Also known as age-based metrics , these rely on an externalreference signal to inform the prediction mechanism of the block’s age. A block’s ageincreases with some notion of time, which is reset upon a hit. The age can be computedin number of cycles [97], number of accesses to the cache [85], or number of accessesto the set [59, 81]. When a block’s age crosses a set threshold (e.g., the maximumobserved age from the previous generations), the block may be predicted dead.A major advantage of indirect metrics is their inherent ability to tolerateuncertainty in a block’s behavior. Coming back to the code snippet in Listing 3.1, acarefully chosen age threshold may allow the block to stay in the cache long enoughto see the second hit, if any, while ensuring that the block won’t greatly overstay itslikely useful lifetime.The drawback of existing indirect metrics is their imprecision and susceptibility to Chapter 3. Leeway – Domain-Agnostic Cache Management S t a c k D i s t a n c e Figure 3.2: Stack Distances for one PC in

GemsFDTD for 16-way set-associative cache.For a cache hit, a stack distance ranges from 1–16. A cache block that is evicted withzero hits is shown to have a stack distance of 0. noise. Because the prediction is made based on events unrelated to the block itself(e.g., the count of all cache accesses), the age used for deciding whether the block isdead must have some tolerance to fluctuation built into it. This tolerance inevitablyincreases the block’s dead time, even for highly predictable blocks, potentially causingthe block to stay in the cache long after its last access while waiting for the age toreach the conservatively set threshold.

Stack distance for a reference to a given cache block is defined as the number of uniquecache blocks accessed since the previous reference to the cache block [112]. Stackdistance provides a useful way to reason about a block’s reuse behavior: blocks thathave short reuse intervals will have short stack distances, while blocks with long reuseintervals will see larger stack distances over their lifetime in the cache. In practice, ashort stack distance means that a block is likely to experience a hit when it is nearthe top of the LRU stack (i.e., close to the MRU position). Conversely, a long stackdistance means that a hit may come near the LRU position, or, if the stack distanceexceeds the associativity of the cache, will result in a miss to the block. By predictingdead blocks early, DBPs aim to keep blocks with long stack distances in the cache longenough for them to see a hit.We make the observation that stack distance can be turned into a powerful metricfor dead block prediction. Fig. 3.2 provides the intuition. The figure shows the observedstack distances for a sample of 250 cache references for all blocks allocated by a singlePC which is responsible for the highest number of LLC misses in

GemsFDTD . The keytake-away is that despite significant variability across references, the stack distance is .2. Motivation Ref

X A X

X A B X

X A A A B B B A X

X F X

X A B C P Q R S T X ∞ ( >

8) 3 Miss

Table 3.1: Stack Distance & Live Distance for block X in 8-way set for a referencepattern

X A X A B X A A A B B B A X F X A B C P Q R S T X .Assuming LRU policy, X incurs 4 cache hits in a generation that starts with a cachefill of the first instance of X in Ref largely confined to 5.Based on this insight, we define

Live Distance as the maximum observed stackdistance during a block’s generation (from insertion to eviction). Live distance is agood indicator of the block’s temporal reuse limit, so when the block’s position withinthe LRU stack exceeds its known live distance, the block is unlikely to be referencedagain and can be predicted dead. To obtain stack distance values, we exploit thefact that LRU-based policies implicitly track stack distances of cache-resident blocks.In true LRU, when a block hits, its current LRU stack position corresponds to itsstack distance. For policies that deviate from true LRU, such as multi-bit NRU (seeSec. 3.3.3 for details), a block’s stack position upon a hit only approximates the truestack distance. Nevertheless, it provides an efficient heuristic to approximate stackdistance and, correspondingly, live distance.Table 3.1 demonstrates how stack and live distance is determined for a block X forvarious reference patterns in a 8-way set. In this example, the largest observed stackdistance is 3, yielding a live distance of 3 and indicating that X can be predicted deadafter the reference to C in reference pattern Chapter 3. Leeway – Domain-Agnostic Cache Management reuse within the LRU stack. Because of this combination, live distance can naturallytolerate variability across generations as long as the reuse interval for the block fallswithin the previously observed range. At the same time, live distance provides anefficient mechanism for rapidly identifying blocks that have exceeded their typicalreuse window and can therefore be predicted dead.Compared to other indirect metrics, live distance has an additional attractiveproperty. By relying on stack distance, which only grows as a result of hits to unique blocks, live distance provides a degree of dampening to noise resulting from variabilityin access patterns to recently-accessed blocks. Because most recently accessed blocksare the ones likely to receive future hits, suppressing variability in these hit counts isbeneficial [84]. For instance, consider reference patterns

We introduce Leeway, a history-based predictive cache management technique thatuses live distance as its underlying metric. We first explain the Leeway basics andfeatures that make it robust against variability in the context of LLC. We then showhow Leeway works with a low-cost 2-bit NRU cache management technique. We thendiscuss microarchitectural details and compare its cost and complexity with priortechniques. Later we extend Leeway to a multi-core setup.

LRU-based Leeway uses a full LRU stack and records the maximum observed hitposition (i.e., live distance) during a block’s residency in the cache. At eviction time,the live distance is recorded in a separate structure,

Live Distance Predictor Table(LDPT) , for subsequent recall when the block is allocated again. Leeway uses thelive distance learned in the block’s previous generations to infer when the blockmay have exceeded its useful lifetime and predicts it dead. To avoid the prohibitivestorage costs of tracking individual cache blocks in the LDPT, Leeway exploits code- .3. Leeway Design data correlation and associates all cache blocks allocated by the same PC with onePC-indexed LDPT entry.The functionality of Leeway can be divided into three categories – Learning,Prediction and

Update . Learning is a continuous process for cache-resident blocksthat involves checking a block’s position in the LRU stack upon each hit and, if thecurrent position exceeds the past maximum, updating the live distance. Prediction istriggered during victim selection on a miss to a set. Any block that has moved pastits predicted live distance in the LRU stack is predicted dead. Update occurs upona block’s eviction from the cache, propagating the latest live distance to the LDPT.To effectively handle variability in live distance across generations of a given blockand across blocks tracked by a single PC-indexed LDPT entry, the update process isconditional as explained in the next section.Leeway implements set-sampling, similar to [67], to learn the blocks’ live distancesby observing their behavior in a small number of sampler sets. Sampling significantlyreduces Leeway’s storage requirement as the only blocks belonging to the samplersets need to be augmented with storage needed for learning.

As explained in Sec. 3.2.1, a block’s observed reuse behavior may fluctuate in time evenif its fundamental reuse characteristics are not changing. While the live distance metricprovides a degree of protection from intra-generation noise, Leeway must contendwith inevitable fluctuation in live distance across generations and across differentblocks allocated by the same PC. In particular, it must separate unrepresentative livedistance values from actual shifts in the reuse behavior. This observation points to theneed for an intelligent update policy for Leeway’s live distance values.To design a variability-tolerant update policy, we study SPEC CPU 2006applications to understand their reuse behavior. Our analysis reveals that applicationstend to fall in one of two categories in terms of their reuse behavior affecting LLCmanagement.The first category is dominated by streaming cache blocks that do not observeany LLC hits and should be bypassed. For example, in mcf , over 90% of cache blocksare not reused after allocation in LLC under LRU. In many cases, however, we findthat blocks allocated by certain streaming PCs will occasionally observe one or morehits. Fig. 3.3 shows one such PC responsible for 21% of the misses in mcf . Moreover, Chapter 3. Leeway – Domain-Agnostic Cache Management L i v e D i s t a n c e Figure 3.3: Variability in live distance with a bias of streaming for a PC in mcf . ALive Distance of 0 indicates a bypass opportunity. L i v e D i s t a n c e Figure 3.4: Variability in live distance with a bias of reuse for a PC in calculix . such behavior sometimes occurs in clusters, forcing a shift in cache managementpolicy from bypassing to keeping blocks on chip. Such a shift is generally undesirable,as the behavior tends to quickly revert back to streaming. A multi-bit hysteresisthreshold may be effective in delaying a shift in policy; however, the high thresholdis counter-productive when the behavior reverts back to streaming as it will lead toblocks being allocated in LLC rather than be bypassed.The second category of applications is dominated by blocks that do see reuse priorto being evicted from the LLC. For example, in calculix , more than 60% blocks arereused at least once after their allocation in LLC under LRU. We observe considerablevariability in live distance for many PCs that allocate blocks exhibiting reuse. Fig. 3.4shows one such PC responsible for 29% of the misses in calculix . This observationis consistent with our work that observed that the blocks exhibiting reuse are moreprone to variability in inter-generational behavior than the streaming blocks, thusposing a challenge for DBPs [32]. Given the uncertainty in the reuse behavior, suchblocks should be kept longer to maximize opportunity for reuse.The two types of behavior naturally lead to a pair of policies designed to maximizebypass opportunities for streaming applications and reuse opportunities for others. .3. Leeway Design This policy seeks to maximize opportunities forbypass by being slow to increase the live distance and fast in dropping it back towards0, in the face of variability in live distance values. An incoming block with a predictedlive distance of 0 is bypassed, unless it maps to a sampler set (see Sec. 3.3.4.2 fordetails).

To maximize reuse opportunities for allocatedblocks when there is a fluctuation in live distance values, this policy is quick to increasethe live distance and slow to decrease it. Since Leeway does not evict blocks that havenot reached their live distance value in the LRU or multi-bit NRU stack, a larger livedistance enables a longer temporal window for a block to uncover reuse.

Enabling the policies:

The two policies call for diametrically opposite behavior:whereas the Bypass-Oriented policy is slow to increase the live distance values inLDPT but fast to decrease them, the Reuse-Oriented policy is fast to increase livedistance values but slow to decrease them. To satisfy the demand for separate policiesin increasing and decreasing live distance in the LDPT, Leeway deploys two

VariabilityTolerance Thresholds (VTTs) that control the rate at which live distance values areadjusted based on workload behavior and the direction of change in live distance.In order to choose the preferred policy for a running application, Leewayleverages Set-Dueling [86] and implements both policies (Bypass- andReuse-Oriented) simultaneously on separate sampler sets. The rest of the cachefollows the policy that minimizes the misses.

So far, we have considered Leeway on top of true LRU, which may be unattractivefor highly-associative caches. In this section, we explain the minimal modificationsrequired to make Leeway work with a low-cost multi-bit

Not Recently Used (NRU) family of techniques.NRU uses 1-bit per cache block to keep track of blocks that have not been usedrecently with respect to some time frame in the past. Multi-bit NRU is an extensionof NRU that uses two or more bits per cache block to indicate a partial relative orderof LRU stack positions. For instance, a 2-bit NRU policy keeps blocks in a set in oneof four equivalence classes as a function of their relative stack positions, with class 1for MRU blocks and class 4 for LRU ones. During victim selection, a block in class 4is evicted (ties are broken through random selection). If no block is found in class 4, Chapter 3. Leeway – Domain-Agnostic Cache Management

Eviction{hash-pc, live-distance} stable-live-distancevariance-countvariance-direction

LDPT Entry(Fields for each Policy)

Way 1 ... Way N

LLC predicted-live-distancehash-pclive-distancepredicted-live-distance

Cache Metadata LDPT

Miss{stable-live-distance}

Bypass Oriented PolicyFollower SetsReuse Oriented Policy

Follower SetsSampler Sets

Figure 3.5: Schematic of Leeway for LLC every block is moved to the next class and the process is repeated. Both RRIP [71] andSHiP [67] use 2-bit NRU.Leeway implementation over (1-bit or multi-bit) NRU,

Leeway-NRU , relies on thepartial relative order maintained by NRU to make dead block predictions. It uses ablock’s NRU value to approximate its stack distance, and in turn, live distance. Itcannot differentiate between the relative order of blocks in the same recency class.In general, Leeway can be implemented with any base technique which maintains(1) a partial relative order of blocks based on their relative reference time and (2) amonotonically non-decreasing order for a given block’s position between re-referencesor until eviction.

Fig. 3.5 summarizes key elements of the design.

LDPT:

Each PC-indexed LDPT entry contains a stable-live-distance field that indicatesthe current live distance based on most recent history. Updates to stable-live-distanceare controlled by VTTs and two additional LDPT fields: (1) variance-count is a counterfor tracking the number of consecutively evicted cache lines whose live distance differsfrom the stored value, and (2) variance-direction is a bit indicating the direction of thechange. Once the count matches the value of a VTT for a given direction, the value of stable-live-distance is updated. To avoid additional storage for transient live distancevalues, the new stable-live-distance value is taken from the evicted block that triggersthe update.

VTTs:

To enable Bypass- and Reuse-Oriented policies, Leeway uses a pair of Variability .3. Leeway Design Tolerance Thresholds that control the rate at which stable-live-distance values areupdated (Sec. 3.3.2). Empirically, we find that a 3-bit VTT is sufficient, and use themaximum value for the slow update (i.e., requiring 7 consecutive evictions with a livedistance different, and in the same direction, from the stable-live-distance ) and a valueof 1 for the aggressive threshold. Thus, the two valid VTT configurations are either{7,1} (for the Bypass-Oriented policy, with a slow increase and fast decrease) and {1,7}(for the Reuse-Oriented policy with a fast increase and slow decrease).

LLC:

Leeway requires all LLC blocks to carry a field, predicted-live-distance , which isread from the LDPT at block allocation time and is subsequently used for dead blockprediction. As this field is embedded in the cache, dead block prediction can be donelocally in cache just by comparing a block’s LRU stack position with the value of itspredicted-live-distance field. Meanwhile, the cache blocks from the sampler sets carrytwo additional fields: live-distance & hash-pc . These are used for learning, allowingevicted blocks to index the LDPT and, if necessary, update its fields as explained above. On an LLC miss, the LDPT is indexed using a hash of the miss PCto recall the stable-live-distance , which is then transferred to the incoming block’s predicted-live-distance field. If stable-live-distance is 0, the block is expected to haveno reuse and is bypassed to the higher-level caches. Since bypassed blocks haveno opportunity to retrain, Leeway inserts them into the sampler sets with a smallprobability (1% for Bypass-Oriented Policy and 3% for Reuse-Oriented Policy) toenhance learning.

On a hit to a sampler set, the block’s live-distance field isupdated if its current stack position is greater than the value of the live-distance field.Meanwhile, for all sets (sampler as well as the follower sets), the block’s predicted-live-distance is also updated if its current stack position is greater than the value of the predicted-live-distance field. Note that the predicted-live-distance field is never usedto update LDPT, and thus the change remains local and protects the only block forwhich the predicted-live-distance is increased.

To find victim, Leeway searches for a deadblock by comparing each block’s LRU or NRU position to its predicted-live-distance field. If more than one blocks are found dead, a block with the minimum predicted-live-distance value is picked for replacement. If no block is found dead, the LRU block Chapter 3. Leeway – Domain-Agnostic Cache Management is evicted. If the evicted block resides in the sampler set (dead or not), its live-distance and hash-pc fields are forwarded to the LDPT for a potential update.

To dynamically choose between Bypass- and Reuse-Oriented policies, Leeway relieson a set-dueling mechanism [86]. Thus, two separate groups of sampler sets are used,with each group implementing one of the two policies. To support simultaneousimplementation of policies, the LDPT must be extended to support two sets of { stable-live-distance, variance-count, variance-direction } fields per entry. While the samplersets always access their dedicated fields based on a static mapping, the rest of the setsread the stable-live-distance from the winning policy.To determine the winning policy, Leeway maintains two saturating miss counters,one for each policy. The counters are incremented on a miss to a sampler set of arespective policy. Periodically, the miss counters are sampled and the winning policyis selected based on the counter with the lowest value.Often, the winning policy remains the same throughout the application’s execution.In some cases, however, the winning policy may change due to changes in theapplication’s phase or its co-runner(s). In theory, a policy change requires reloading predicted-live-distance for all cache blocks using the stable-live-distance of the newwinning policy in LDPT. In practice, we find that policy change is infrequent, indicatingthat the simplest way to deal with it is to leave existing blocks untouched, potentiallyincurring a handful of poor decisions but minimizing microarchitectural complexity.

Storage cost:

We analyze storage requirements for a 16-way 2MB LLC with 64Bblocks. We find that a 16K-entry LDPT per core is sufficient and is not affected bydestructive aliasing, thus affording a tagless design. For LRU-based Leeway, eachLDPT entry of each of two Leeway policies has 8 bits: 4 for stable-live-distance , 3 for variance-count and 1 for variance-direction . The resulting cost of LDPT is thus 32KB.We use a 64-set sampler per policy. Each block in the sampler carries a 4-bit live-distance and 14-bit hash-pc fields, requiring 4.5KB of storage in total. All cache blocks,including the sampler, include a 4-bit predicted-live-distance , totaling 16KB storage.The total storage storage of Leeway is thus 68.5KB (52.5KB overhead + 16KB of LRUstate), or 2.3% of the LLC storage. Using 2-bit NRU instead of LRU further reduces the .3. Leeway Design Technique Recency Predictor State (KB) Total When is HistoryState (KB) Within LLC External to LLC (KB) Table accessed?SDBP [73] 16 4 18.75 38.75 Hits + MissesSHiP [67] 8 3.75 6 17.75 Misses*Hawkeye [37] 12 - 19 31 Hits + MissesLeeway-LRU 16 20.5 32 68.5 MissesLeeway-NRU 8 12 24 44 Misses

Table 3.2: Storage cost (excluding tag and data) for 16-way 2MB LLC, 128 samplersets, and 16K-entry Predictor Table for history-based predictive techniques. (*ForSHiP, cache hits to the follower sets do not access the history table. Meanwhile,cache hits to the sampler sets do update the history table; however, the updates tothe table can be pipelined and taken off the critical path.) storage by 36% to 44KB, or 1.4% of the LLC storage, by lowering live distance storagecosts from 4 to 2 bits.Table 3.2 compares the storage requirements of Leeway to those of prior techniques.SHiP [67], an insertion technique, has the lowest storage cost at the expense of notpredicting blocks that are reused. Among dead block predictors that also predictreused blocks, the preferred Leeway-NRU configuration requires 44KB of storagein total (including NRU bits), compared to 38.75KB for SDBP [73] and 31KB forHawkeye [37], considering the same number of sampler sets and predictor tableentries for all techniques. While Leeway is slightly more expensive, we observe thatthe storage requirements for all techniques are in a similar range of several tens ofKBs. Such modest storage requirements are dwarfed by the size of the LLC.

Complexity:

Operations performed by Leeway at various stages are limited to simpleadditions and comparisons, which are quite hardware friendly. Additionally, Leewayembeds the metadata necessary for the prediction (i.e., live distance ) with the cacheblocks. As a result, LLC hits and replacement decisions never access remote metadata.The only time Leeway accesses its prediction table (LDPT) is upon cache misses, when stable-live-distance is read and possibly updated. These accesses are entirely off thecritical path, since they do not involve state updates to a live cache block.In contrast, state-of-the-art predictive techniques, such as SDBP [73] andHawkeye [37], use a PC-indexed prediction table that is probed on every LLC access(including a cache hit) to inform the block’s eviction priority. For example, Hawkeyeincurs 2.3 𝑥 accesses to its prediction table when compared to Leeway (SPEC average). Chapter 3. Leeway – Domain-Agnostic Cache Management

Such frequent accesses to the prediction table are particularly undesirable in amodern multi-core processor with a NUCA LLC, as each LLC hit requiresstate-of-the-art predictive techniques to access the PC-indexed prediction tablelocated elsewhere on a chip, incurring latency, energy, and traffic overheads due tothe need to traverse the on-chip network.

Leeway can naturally be extended to multi-core deployments. The only notabledifference is in determining the winning policy for each individual core. Whenextended to multi-core, the sampler sets for a given core, referred to as the owner core ,are shared with other follower cores that will use them as followers of their respective(and potentially different) policies. Thus, the cache policy for each core seeks tominimize the total misses across all applications.

Note that a core may select a policywhich may not work best for its own application but reduces overall misses.

Microarchitectural extensions:

For a multi-core setup, LDPT is implemented as aper-core private structure. Thus, when a core initiates a memory instruction, LDPTthat is private to the core is accessed using the PC of the memory instruction. Aswith single-core implementation, Leeway requires two saturating counters per core(one each for Bypass- and Reuse-Oriented policies) for tracking aggregate misses in asampling interval.

We evaluate the performance of SPEC CPU 2006 applications using a modified versionof CMP$im [79] provided with the JILP Cache Replacement Championship [68]Table 3.3 summarizes the features of the simulated processor.For each SPEC application, we use

SimPoint [95] to identify up to six simpoints ofone billion instructions each representing a different phase of an application. We useSimPoint tool to generate the weights for each simpoint that are then used to calculatethe overall performance. Each program is run with the first ref input provided by runspec command. For each run, the simpoint is used to warm microarchitecturalstructures for 200M instructions, then it measures and reports the result for the .4. Methodology Core Model OoO: 4-wide pipeline, 128-entry ROBL1 Caches Private, Split, 8-ways 32KBL2 Cache Private, Unified, 8-ways 256KBL3 Cache Shared, Unified, 16-ways 2MB per coreNon-Inclusive Non-ExclusiveMemory 200-cycle access latency

Table 3.3: System parameters for simulations. subsequent one billion instructions. The result reported for each benchmark is theweighted average of the results for the individual simpoints.For multi-core applications, we use 100 multi-programmed mixes, with eachindividual application for a mix is randomly selected from 23 (of 29) SPEC applicationswhose performance is sensitive to cache replacement decisions. For each application inthe mix, we use the highest weighted simpoint. Each mix is run on a quad-core systemfor 1 billion instructions following a warmup of 200 million instructions. Applicationswhich finish before others are restarted to maintain the cache pressure until the slowestone has finished. We report the weighted speed-up over LRU. To compute it, we runevery application in isolation with 8MB LLC under LRU to calculate

𝑆𝑖𝑛𝑔𝑙𝑒𝐼 𝑃 𝐶 𝑖 . Wethen calculate Weighted IPC as ∑ 𝑁𝑖=1 (𝐼 𝑃 𝐶 𝑖 / 𝑆𝑖𝑛𝑔𝑙𝑒𝐼 𝑃 𝐶 𝑖 ), where 𝐼 𝑃 𝐶 𝑖 is the application’sIPC in the presence of co-runners. RRIP [71] is the state-of-the-art lightweight dynamic technique that does not dependon history-based learning. We implement RRIP based on the source code from thecache replacement championship [68] for RRIP.

Sampling Dead Block Predictor (SDBP) [73] is a dead block predictor that correlates“last touch” to the block with the PC of the memory instruction making the touch. Weuse source code from the cache replacement championship [68] for SDBP. We usedefault settings provided for SPEC workloads except for increasing the number ofsampler sets from 32 to 128.

Signature-based Hit Predictor (SHiP) [67] is an insertion policy which builds onRRIP [71]. It learns and records whether a block is re-referenced after insertion anduses this information to guide insertion placement. We implement SHiP with 2-bit Chapter 3. Leeway – Domain-Agnostic Cache Management

RRIP as a baseline technique and 14-bit PC signature. Each predictor table entrycontains a 3-bit saturating counter which is updated by the 128 sampled sets.

Hawkeye [37] learns a block’s behavior by simulating Belady’s optimalalgorithm [114] and trains the predictor that, on each cache access, updates theblock’s eviction priority. The authors kindly provided the source code of theirtechnique, which we use for the evaluation.

Leeway:

For learning, Leeway uses 64 sets per core for each policy. Leeway usesset-dueling to find the preferred policy (Sec. 3.3.4.3). Miss counters are sampled every200M instructions or 100K cache accesses in the sampler sets, whichever occurs first.The LDPT has 16K entries per core. Finally, for the configurations that enable dataprefetchers in the higher-level caches, Leeway always uses Bypass-Oriented Policy forthe cache blocks inserted by prefetch requests. Leeway implementations are referred toas Leeway-LRU or Dynamic Leeway-LRU for LRU-based implementations and Leeway-NRU or Dynamic Leeway-LRU for NRU-based implementations. Leeway-NRU uses2-bit NRU as the base technique, unless specified otherwise.

In this section, we evaluate Leeway and state-of-the-art cache management techniqueson four different machine configurations – single-core with data prefetchers off, single-core with data prefetchers on, quad-core with data prefetchers off and quad-core withdata prefetchers on. We first provide average speed-ups for all techniques for eachconfiguration. Next, we analyze performance for both quad-core configurations inSec. 3.5.1, followed by a detailed analysis for a single-core configuration in Sec. 3.5.2.Fig. 3.6 shows average speed-up for SPEC applications on all four deploymentconfigurations. For each configuration, the speed-up is reported over the baselineimplementing LRU-managed cache on the same configuration. While we below discussthe speed-up for different techniques on each configuration, it is worth noting that thebaseline configurations with data prefetchers by themselves outperform the respectiveconfiguration without the data prefetchers for LRU, 39.1% for single-core and 33.0%for multi-core, which is not shown in this figure.When data perfetchers are off, both Leeway implementations achieve goodperformance for both single-core and quad-core configurations. On a single-coreconfiguration, Leeway-LRU and Leeway-NRU both yield an average speed-up of 6.5% .5. Evaluation prefetch:off prefetch:on single-core quad-core single-core quad-core S peed - up ( % ) RRIP SDBP SHiP Hawkeye Leeway-LRU Leeway-NRU

Figure 3.6: Average speed-up for SPEC applications on four machine configurations. over LRU vs 3.9% for RRIP, 4.3% for SDBP, 4.5% for SHiP and 6.4% for Hawkeye. On aquad-core configuration, Leeway-LRU and Leeway-NRU yield an average speed-up of7.5% and 8.0%, respectively, vs 4.0% for RRIP, 6.9% for SDBP, 8.0% for SHiP and 9.7%for Hawkeye.When the data perfetchers in the higher-level caches are on, average speed-ups forprior techniques significantly drops whereas both Leeway implementations continue toachieve good performance. On a single-core configuration, Leeway-LRU and Leeway-NRU yield an average speed-up of 4.5% and 4.8%, respectively, vs 1.9% for RRIP, 1.0%for SDBP, 2.1% for SHiP and 1.7% for Hawkeye. Similarly, on a quad-core configuration,Leeway-LRU and Leeway-NRU outperform prior techniques with an average speed-upof 7.7% and 7.8% over LRU, respectively, vs 2.7% for RRIP, 4.1% for SDBP, 4.8% forSHiP and 0.8% for Hawkeye. Note that Hawkeye, which provides the highest averageperformance among prior techniques in the absence of data prefetchers, is among theleast effective techniques in the presence of data prefetchers.A quad-core configuration with data prefetchers is the most representative ofa real-world deployment scenario. The performance trend on this configurationshows that history-based predictive techniques (except for Hawkeye) outperform RRIP(state-of-the-art lightweight dynamic technique) and LRU (a recency-friendly statictechnique), corroborating prior works [67, 73]. Surprisingly, Hawkeye provides theleast performance improvements, which is a new result as the prior work evaluatedHawkeye in the absence of data prefetchers [37].

In this section, we evaluate the effectiveness of Leeway-NRU and three state-of-the-arthistory-based predictive techniques (SDBP, SHiP and Hawkeye) for both quad-coreconfigurations. We omit the results for RRIP and Leeway-LRU from the subsequentstudies for brevity. Chapter 3. Leeway – Domain-Agnostic Cache Management -10010203040

Multiprogrammed workload mixes S peed - up ( % ) SDBP (Average Speed-up : 6.9%)SHiP (Average Speed-up : 8.0%)Hawkeye (Average Speed-up : 9.7%)Leeway-NRU (Average Speed-up : 8.0%)

Figure 3.7: Weighted speed-up for multi-programmed SPEC mixes when prefetchersare off. The speed-ups for mixes are sorted for each technique individually. -10010203040

Multiprogrammed workload mixes S peed - up ( % ) SDBP (Average Speed-up : 4.1%)SHiP (Average Speed-up : 4.8%)Hawkeye (Average Speed-up : 0.8%)Leeway-NRU (Average Speed-up : 7.8%)

Figure 3.8: Weighted speed-up on multi-programmed SPEC mixes when prefetchersare on. The speed-ups for mixes are sorted for each technique individually.

In the absence of prefetchers , all techniques provide similar average speed-up,with SDBP providing the lowest (6.9%) and Hawkeye providing the highest (9.7%)average speed-up as shown in Fig. 3.7. Hawkeye’s effectiveness can be attributed toits learning mechanism. Like other techniques, Hawkeye also relies on a PC-basedreuse correlation. However, unlike other techniques, Hawkeye’s learning mechanismsimulates optimal replacement on past LLC accesses, and thus provides more accuratereuse predictions.

In the presence of prefetchers , variability in the reuse behavior of cache blocksincreases as prefetchers speculatively load cache blocks in the higher-level caches,some of which are bound to be inaccurate, leading to extra LLC accesses that wouldnot have occurred in the absence of prefetchers. As shown in Fig. 3.8, Leeway-NRU isthe most effective in tolerating prefetcher-induced variability by yielding an averagespeed-up of 7.8% over LRU. In comparison, SDBP and SHiP yield an average speed-up of 4.1% and 4.8% respectively. Hawkeye provides the least performance with anaverage speed-up of 0.8%, in stark contrast to its performance without the prefetchers.When compared to the prior techniques, Leeway-NRU achieves an average speed-up of 3.5% over SDBP, 2.9% over SHiP, 6.9% over Hawkeye and 7.8% over LRU. Of the .5. Evaluation m c f c a c t u s s op l e x a s t a r s ph i n x x a l an g m ean M i ss R edu c t i on ( % ) SDBP SHiP Hawkeye Leeway-NRU (a) Miss Reduction over LRU m c f c a c t u s s op l e x a s t a r s ph i n x x a l an g m ean S peed - up ( % ) SDBP SHiP Hawkeye Leeway-NRU (b) Speed-up over LRU

Figure 3.9: Evaluation of various cache management techniques for the

HighOpportunity

SPEC CPU 2006 applications. Name of some applications are shortenedas follows: cactus for cactusADM, sphinx for sphinx3 and xalan for xalancbmk.

100 evaluated mixes, on 78 mixes Leeway-NRU provides higher performance than anyof the prior techniques, while outperforming SDBP on 85 mixes, SHiP on 79 mixesand Hawkeye on 93 mixes.

In this section, We provide a detailed performance analysis of various techniquesfor a single-core configuration with data prefetchers off as this configuration has theminimum noise in access patterns. In other configurations, the reuse behavior of cacheblocks is significantly affected by prefetchers or cache pressure from the co-locatedworkloads sharing LLC.To better understand the effects of all cache management techniques, we classifySPEC applications into three categories: (1)

High Opportunity , if performance improvesby at least 10% over LRU with any one technique; (2)

No Opportunity if performancedoesn’t vary by more than 0.5% for all techniques; (3)

Mix Opportunity for the rest.

High opportunity applications:

Fig. 3.9(a) shows the reduction in LLC misses andFig. 3.9(b) shows the improvement in performance compared to the baseline LRUfor the high opportunity applications. Overall all techniques are highly effective onthese applications with Leeway-NRU reducing the most misses on average (28.9% overLRU), vs 23.2% for SDBP, 23.9% for SHiP and 26.5% for Hawkeye. The performance Chapter 3. Leeway – Domain-Agnostic Cache Management -367-25 -169-22 -4862545353 -100102030 pe r l b z i p g cc b w a v e s m il c z eu s m pg r o m a cs l e s li e dea l c a l c u li x h m m e r ge m s f li bq h264 r e f t on t o o m ne t pp w r f g m ean M i ss R edu c t i on ( % ) SDBP SHiP Hawkeye Leeway-NRU (a) Miss Reduction over LRU -21-10 -9 -50510 pe r l b z i p g cc b w a v e s m il c z eu s m pg r o m a cs l e s li e dea l c a l c u li x h m m e r ge m s f li bq h264 r e f t on t o o m ne t pp w r f g m ean S peed - up ( % ) SDBP SHiP Hawkeye Leeway-NRU (b) Speed-up over LRU

Figure 3.10: Evaluation of various cache management techniques for the

MixOpportunity

SPEC CPU 2006 applications. Name of some applications are shortenedas follows: perl for perlbench, bzip for bzip2, leslie for leslie3d, deal for dealII, gemsffor GemsFDTD and libq for libquantum. of all techniques generally correlate well with the miss reduction, with Leeway-NRUachieving the highest average speed-up (27.6% over LRU), vs 19.7% for SDBP, 21.0%for SHiP and 24.0% for Hawkeye.

Mix opportunity applications:

Fig. 3.10(a) shows the reduction in LLC misses andFig. 3.10(b) shows the improvement in performance compared to the baseline LRU forthe mix opportunity applications. Overall, Hawkeye and Leeway-NRU are far moreeffective than SDBP and SHiP on the mix opportunity applications with 12.0% averagemiss reduction for Hawkeye and 9.5% for Leeway-NRU vs only 2.0% for SDBP and3.4% for SHiP.For four applications ( zeusmp , calculix , tonto and omnetpp ), at least one of thetechniques incurs more misses than the baseline LRU. For two of these applications,Leeway-NRU also increases misses, but the miss reduction is relatively small. Forexample, on zeusmp , Leeway-NRU increases misses by 3.7% vs 25.5% for SHiP. Similarly,on calculix , Leeway-NRU increases misses by 47.7% vs 366.6% for SDBP and 168.7%for SHiP. On tonto and omnetpp , SDBP and SHiP increase misses (1%-9%) whereasLeeway-NRU manages to reduce misses (4%-5%) over LRU.The performance of all techniques generally correlate well with the miss reductionwith Hawkeye and Leeway-NRU achieving an average speed-up of 3.0% and 2.1%, .5. Evaluation

19 25 -50510 ga m e ss na m d gob m k po v r a y s j eng l b m g m ean M i ss R edu c t i on ( % ) SDBP SHiP Hawkeye Leeway-NRU (a) Miss Reduction over LRU -101 ga m e ss na m d gob m k po v r a y s j eng l b m g m ean S peed - up ( % ) SDBP SHiP Hawkeye Leeway-NRU (b) Speed-up over LRU

Figure 3.11: Evaluation of various cache management techniques for the

NoOpportunity

SPEC CPU 2006 applications. respectively vs 0.9% for SDBP and 0.7% for SHiP. Leeway-NRU slows down the fewestapplications, zeusmp and calculix , with the maximum slowdown of 3.6%. In comparison,SDBP slows down 3 applications (max slowdown of 20.6%), SHiP slows down 5applications (max slowdown of 10.3%) and Hawkeye slows down 3 applications (maxslowdown of 1.7%).

No opportunity applications:

Fig. 3.11(a) shows the reduction in LLC misses andFig. 3.11(b) shows the improvement in performance compared to the baseline LRU forthe no opportunity applications. An average miss reduction for all techniques rangebetween 1%-6%. However, the performance for these applications is not sensitive toreplacement decisions and the change in performance due to any technique is at most0.5% over LRU.

Hawkeye’s learning mechanism simulates optimal replacement (OPT) on past LLCaccesses, unlike Leeway (as well as SDBP and SHiP) that relies on baseline LRU or NRUfor learning. Thus, Hawkeye, in theory, can provide more accurate reuse predictions.For example, between two cache blocks, each having a reuse distance greater than theassociativity, OPT can identify a cache block having a smaller reuse distance accurately,in contrast to LRU-like techniques. Thus, Hawkeye is more likely to retain a cacheblock with a smaller reuse distance in the presence of thrashing than Leeway. Chapter 3. Leeway – Domain-Agnostic Cache Management

Single-core Configuration Coverage AccuracyHawkeye 80.3% 78.4%Leeway-NRU 82.8% 72.3%

Table 3.4: Prediction coverage and accuracy, averaged across SPEC applications(excluding the no opportunity applications) on a single-core configuration in theabsence of data prefetchers.

To quantitatively support this hypothesis, we study prediction coverage andaccuracy for Hawkeye and Leeway-NRU. Coverage is measured as a percentage oftotal evictions that are predicted dead by a cache management technique. Accuracy ismeasured as a percentage of predicted evictions that are correct. Table 3.4 shows prediction coverage and accuracy for Hawkeye and Leeway-NRU, averaged across SPEC applications (excluding the no opportunity applications).Hawkeye’ prediction coverage is nearly the same as Leeway-NRU. However, Hawkeyehas a higher prediction accuracy (78.4% vs 72.3% for Leeway-NRU), thanks to theOPT-based learning.In the presence of data prefetch, however, effectiveness of Hawkeye reducessignificantly. Amidst the prefetcher-induced variability, Hawkeye takes a conservativeapproach and makes far less predictions, reducing the opportunity to evict dead blocks.Prediction coverage for Hawkeye averages 71.2% (vs 80.3% without prefetch) andaccuracy also drops to 74.3% (vs 78.4% without prefetch), explaining Hawkeye’s poorperformance in the presence of data prefetchers.

Reuse-aware update policies:

To understand the effect of Leeway’s policy choice,we compare the performance of individual static policies (Bypass- and Reuse-Oriented)with an adaptive scheme (Dynamic Leeway or simply Leeway) that dynamicallychooses one of the static policies at runtime (Sec. 3.3.2). Dynamic Leeway was usedthroughout the evaluation. Fig. 3.12 presents the results for SPEC applications on asingle-core configuration without data prefetcher. No opportunity applications arenot shown for clarity. While comparing coverage and accuracy of different techniques, it should be noted that both areself normalized metrics; if the total evictions under two techniques are significantly different for thesame application, analyzing coverage and accuracy metrics in isolation may lead to wrong conclusions. .5. Evaluation BOP ROP m c f c a c t u s s p h i n x x a l a n g m e a n s o p l e x a s t a r g m e a n S peed - up ( % ) Bypass Oriented Leeway Reuse Oriented Leeway Static Leeway Dynamic Leeway (a) Speed-up over LRU for High Opportunity Applications

BOP ROP p e r l m il c d e a l h m m e r li b q w r f g m e a n b z i p g c c b w a v e s z e u s g r o m a c s l e s li e c a l c u li x g e m s f h 2 6 4 r e ft o n t o o m n e t p pg m e a n -10-50510 S peed - up ( % ) Bypass Oriented Leeway Reuse Oriented Leeway Static Leeway Dynamic Leeway (b) Speed-up over LRU for Mix Opportunity Applications

Figure 3.12: Evaluation of various Leeway-NRU configurations (all using 2-bit NRUas the base policy).

Applications benefiting more from the Bypass-Oriented Policy (BOP) are shownin the Fig. 3.12. Such applications include four of the six high opportunityapplications (left group of Fig. 3.12(a)) and several mixed opportunity ones (left groupof Fig. 3.12(b)). For these applications, the access pattern is dominated by bypassableblocks. For example, for these applications, on average, only 7.7% (max 26.3% for deal )of blocks inserted in the cache incur at least one hit under the OPT replacementpolicy. The Reuse-Oriented Policy conservatively increases the live distance in theface of variability. Predicting high live distance for such blocks only contributes inincreasing the dead time, which, in turn, lowers the cache efficiency.Right side of Fig. 3.12(a) and Fig. 3.12(b) respectively show two high opportunityapplications and several mixed opportunity applications that benefit more from theReuse-Oriented Policy (ROP). For most of these applications, none of the techniques arevery effective. The culprit is high incidence of blocks with reuse and inter-generationalvariability. For example, for these applications, on average, 33.9% (max 74.7% for tonto )of blocks inserted in the cache incur at least one hit under the OPT replacementpolicy. In the case of Leeway, the Reuse-Oriented policy generally proves beneficialby steering the live distance toward the recently-observed maximum in order to boostthe opportunity for reuse. For instance, this proves particularly beneficial on omnetpp ,on which Leeway-NRU is the only technique to avoid a slowdown (see Fig. 3.10(b)).To understand how BOP and ROP makes predictions, we compare the coverage Chapter 3. Leeway – Domain-Agnostic Cache Management

Coverage Accuracy mcf calculix Avg. mcf calculix Avg.0%25%50%75%100%

Bypass Oriented Policy Reuse Oriented Policy

Figure 3.13: Prediction Coverage and Accuracy for Leeway-NRU static policies. and accuracy for both Static BOP and Static ROP policies. Fig. 3.13 shows predictioncoverage and accuracy, averaged across all SPEC applications (excluding the noopportunity ones). The figure also shows data for mcf , which prefers BOP and calculix ,which prefers ROP as representative examples. On mcf , ROP reduces coverage to86.5% from 99.5% for BOP. However, that only marginally increases accuracy to 96.1%from 95.5% for BOP. The end result is the loss of opportunity for ROP in makingpredictions (indicated by low coverage), which hurts performance. On calculix , ROPreduces coverage to 92.4% from 97.9% for BOP. However, that significantly increasesaccuracy to 64.5% from 46.7% for BOP, providing higher performance for ROP. Theresults show that BOP, in general, trades coverage for accuracy, which is beneficialfor applications that are dominated by bypassable blocks as likelihood of makingwrong prediction is already low to begin with. In contrast, ROP trades accuracyfor coverage, which is beneficial for applications that exhibit significant amount ofinter-generational variability.Finally, we show that at runtime, Dynamic Leeway generally selects the policythat is most suited for a given application. Recall Fig. 3.12, which shows that DynamicLeeway effectively selects between two static policies, ROP and BOP, for allapplications with Dynamic Leeway matching the performance of the best performingstatic policy. Moreover, Leeway can adapt to phase behavior within a singleapplication, as demonstrated on three applications ( mcf , hmmer and xalan ) that havedistinct cache behavior across phases. On these applications, dynamic Leewayoutperforms the best static policy by over 2%. Reuse-unaware static Leeway:

To isolate the performance due to dead blockpredictions using live distance as a metric from the reuse-aware dynamic updatepolicies, we evaluate

Static Leeway-NRU . Static Leeway-NRU employs a static VTTvalue of 7 in both directions, and thus does not require set-dueling for policyselection, requiring only 32K of total storage (vs 44KB for Dynamic Leeway-NRU).Fig. 3.12 shows the performance for Static Leeway on SPEC applications. Overall, .5. Evaluation prefetch:off prefetch:on single-core quad-core single-core quad-core S peed - up ( % ) Leeway-LRULeeway-NRU (4b) Leeway-NRU (3b)Leeway-NRU (2b) Leeway-NRU (1b)Static Leeway-NRU (2b)

Figure 3.14: speed-up for various Leeway configurations.

Static Leeway provides an average speed-up of 5.3%; however, due to its reuse-unawaredesign, it underperforms the dynamic Leeway-NRU (6.5%) for almost all applications,thus justifying the additional storage cost in LDPT for the Dynamic Leeway design.

In this section, we evaluate sensitivity of performance for Leeway-NRU on the numberof bits used by the baseline NRU technique. So far, we have used Leeway-NRU with2-bits per cache block. Fig. 3.14 shows average speed-up for Leeway-NRU (1-4 bits percache block) for all four different configurations. The figure also shows performancefor Leeway-LRU as reference. Overall, Leeway-NRU (2b), which was used throughoutthe evaluation, consistently provides good performance across the configurations.Leeway-LRU uses LRU as the baseline technique, which maintains precise recencystate for the cache blocks in a set. However, this is largely beneficial only forapplications that benefit more from Reuse-Oriented Policy. For example, on a single-core configuration in the absence of data perfetchers, across applications that benefitmore from ROP, Leeway-LRU provides 3.2% average speed-up vs 2.8% for the bestperforming Leeway-NRU. Meanwhile, for applications that benefit more from BOP,Leeway-LRU achieves an average speed-up of 10.1% vs 10.9% for the best performingLeeway-NRU. As explained in Sec. 3.5.4, for applications that benefit more from BOP,are dominated by bypassable blocks. For these applications, maintaining preciserecency state is not required (and sometimes counterproductive) as live distance formost of the blocks is zero.

As explained in Sec. 3.3.5, prior techniques such as Hawkeye access their history tableson every cache access, increasing on-chip traffic. Table 3.5 compares the number oftable look-ups across techniques. Overall, SDBP and Hawkeye require 2.3 and 2.5 Chapter 3. Leeway – Domain-Agnostic Cache Management

Technique SDBP SHiP Hawkeye Leeway-NRUTable Lookups 2.5 𝑥 𝑥 𝑥 𝑥 Table 3.5: History table look-ups, normalized to Leeway-NRU, averaged over SPECapplications (excluding no opportunity ones). times the table look-ups when compared to that of Leeway-NRU. Also note that almosthalf the number of look-ups for SDBP and Hawkeye are during cache hits, and thusare on the critical path. In contrast, Leeway not just requires significantly fewer tablelook-ups, but also performs all these look-ups only during cache misses, which are offthe critical path.

Leeway trades more storage for fewer table look-ups by embedding prediction metadatain the cache, because of which, Leeway-NRU requires slightly more storage than theprior techniques as shown in Sec. 3.3.5. While the storage requirement is relativelysmall in comparison to the LLC, there is some room for reducing storage for Leewayby changing the Leeway-NRU configuration as follows: (1) By reducing NRU bitsfrom 2 to 1, storage requirement for Leeway-NRU drops from 44KB to 32KB. (2) Otheroption is to use Static (i.e., Reuse-unaware) version of Leeway-NRU, which reduces thestorage requirement of LDPT by half, reducing the total storage from 44KB to 32KB.However, those configurations also lead to lower performance as shown in Table 3.6.

In this section, we provide an evaluation summary of concurrent techniques submittedto the Cache Replacement Championship (CRC2) [20]. Each competing technique

Leeway-NRU Metadata Avg. Speed-upImplementation Storage (Core:Quad-Prefetch:On)Dynamic Leeway-NRU (2-bits) 44KB 7.8%Dynamic Leeway-NRU (1-bit) 32KB 6.9%Static Leeway-NRU (2-bits) 32KB 7.0%

Table 3.6: Per-core storage cost (assuming 2MB of LLC) for different Leeway-NRUimplementations. Fig. 3.14 shows the performance for the other configurations. .6. Evaluation of Concurrent Techniques single-core SPEC quad-core SPEC quad-core Cloudsuite S peed - up ( % ) LIME MPP RED SHiP++ Hawkeye++ Leeway-NRU

Figure 3.15: Average speed-up for three benchmark suites – single-core and quad-core multi-programmed SPEC applications, and quad-core Cloudsuite applications. was allowed to utilize the maximum storage of 32KB. We evaluate top five rankedtechniques – LIME [25], MPP [19], RED [13], SHiP++ [27] and Hawkeye++ [17] –from the fifteen techniques competed in CRC2. SHiP++ and Hawkeye++ are improved,prefetch-aware, implementations of SHiP [67] and Hawkeye [37], respectively. Weevaluate techniques using the methodology used in CRC2, which is very similar to themethodology used in the evaluation so far (Sec. 3.4) except for two major differencesas follows: (1) CRC2 uses the ChampSim [12] cycle-accurate simulator instead ofCMP$im [79]. (2) CRC2 evaluates five Cloudsuite [58] applications – media streaming,web search, software testing, data serving, map reduce – as a representative benchmarksuite for the server applications executing in the data-centers, in addition to single-coreSPEC CPU 2006 applications and quad-core multi-programmed SPEC applications.We note that the implementation of Leeway-NRU submitted in CRC2 had a bug,because of which, live distance values were not read correctly from the LDPT. We usethe updated version from https://github.com/faldupriyank.com/leeway in thisevaluation. This implementation is identical to the Dynamic Leeway-NRUimplementation evaluated so far, except that we reduced the number of LDPT entriesto bring the total storage under 32KB.Fig. 3.15 shows an average speed-up over LRU for all three application benchmarksuites with data prefetchers kept on for all simulations. On single-core SPECapplication benchmarks, Leeway provides an average speed-up of 1.7% vs 1.8% for thebest performing techniques (MPP, SHiP++ and Hawkeye++). Leeway achieves ahigher average speed-up than LIME (1.3%) and RED (1.2%).On quad-core SPEC benchmarks, Leeway yields an average speed-up of 5.1% overLRU vs 6.4% for Hawkeye++, the best performing technique. Leeway achieves a higheraverage speed-up than LIME (4.2%) and MPP (3.8%).Finally, on Cloudsuite applications, Leeway achieves an average speed-up of 1.7% vs1.9% for MPP, the best performing technique. Leeway achieves a higher average speed- Chapter 3. Leeway – Domain-Agnostic Cache Management up than all but MPP while Hawkeye++ achieves the least average speed-up (0.9%).The results show that Leeway consistently provides good performance across thebenchmark suites, which is on par with the best performing concurrent techniques.While all these techniques utilize the same storage of 32KB for predictions, note thatthe results do not factor in the hardware complexities. For example, Hawkeye++, thewinner of the CRC2, has fundamentally similar design as Hawkeye, and thus requireshistory table look-ups on every cache access. Recall from Table 3.5 that Hawkeyerequires significantly higher number of history table look-ups than Leeway-NRU.Moreover, about half of the history table look-ups for Hawkeye are on cache hits andthus on the critical path. In contrast, Leeway look-ups are exclusively on cache misses,thus are off the critical path, making Leeway more attractive from the implementationpoint of view.

Duong et al. introduced a DBP based on the notion of Protected Distance (PD) [59].PD leverages reuse distance, an indirect metric that counts non-unique referencesto a set. A single PD is used for an entire application. If a block is not referencedbeyond the application’s PD, it is predicted dead. While conceptually PD soundssimilar to Leeway, Leeway has two key advantages over PD. First, PD maintains asingle Protected Distance for an entire application, whereas Leeway maintains a LiveDistance per PC that is continuously trained throughout the application’s execution.This maximizes Leeway’s adaptivity while minimizing dead time of blocks prior toprediction. Secondly, Live Distance relies on stack distance, and thus naturally “filters”non-unique references to the set. In contrast, PD counts all references to the set,which can inflate PD values and lead to increased dead time for cache blocks. Indeed,our evaluation of PD shows that it is generally inferior to both Leeway and otherrecent cache management techniques. On SPEC, average performance improvementfor Leeway-NRU is 6.5% versus 4.4% for PD for a single-core configuration withoutdata prefetchers, and 4.8% versus 1.1% (in favor of Leeway-NRU) with the prefetchers.Others have also suggested using stack distance or reuse distance for cachereplacement or modeling [31, 43, 56, 59, 85, 93]. Doing so requires maintaining a ReuseDistance Distribution (RDD) for an application, which itself can be storage intensive asit involves keeping separate counter for different reuse distances maintained. Further,turning this RDD into a useful metric is challenging and computationally intensive. .8. Conclusion For example, [59] proposes dedicated compute logic while [31] relies on a softwareframework that runs on a core. In contrast, Leeway monitors the readily-availablestack position within a set, which is already maintained by the base replacement policy.Deriving a block’s live distance is then as simple as taking the max of observed stackpositions upon hits in its lifetime. Thus, live distance fundamentally enables a veryefficient hardware implementation within this general class of metrics.Teran et al. [40] proposed perceptron learning based predictor for LLC. Instead ofcorrelating cache block behavior with just a single feature like load-PC, the predictorcombines multiple features for predicting block’s reuse behavior. To do so, the predictormaintains a separate predictor table for each feature, for a total of six tables. Each ofthese predictor tables need to be accessed on every cache access (including hits) whichmakes this design difficult to scale for multi-core processors as explained in Sec. 3.3.5.Contrary to the traditional recency stack,

Pseudo-LIFO [76] manages the LLC as afill stack. The approach dynamically learns the preferred eviction positions withinthe fill stack, and prioritizes the blocks close to the top of the stack for eviction. Itlearns the preferred positions for an application based on the combined behavior of allthe cache blocks, lacking fine-granularity adaptation that state-of-the-art approaches,including Leeway, use.We primarily used DBPs for efficient cache management of LLC. Prior works haveproposed DBPs for other use cases. Lai et al. used dead block prediction to optimizecoherence protocol [104]. They proposed predicting the last access to a cache block ona core and self-invalidating the block after its last access; consequently, a subsequentaccess to the same cache block in other core do not incur the invalidation latency,improving the performance for applications dominated by coherence communications.Lai et al. used DBPs at L1D and used dead blocks as prefetch targets, obviating theneed for auxiliary prefetch buffers [102]. Prior works have explored using dead blockprediction to dynamically turn-off cache blocks at LLC that are predicted dead to reduceleakage power [90, 94, 101]. Khan et al. used dead block prediction to implementvirtual victim cache [72]. They used dead blocks to hold blocks evicted from othersets, thus forming a pool of dead blocks as a virtual victim cache.

In this chapter, we showed that variability in the reuse behavior of cache blocks limitsstate-of-the-art history-based predictive techniques in achieving high performance. In Chapter 3. Leeway – Domain-Agnostic Cache Management response, we argued for variability-tolerant mechanisms and policies for cachemanagement. As a step in that direction, we proposed Leeway, a history-basedpredictive technique employing two variability-tolerant features. First, Leewayintroduces a new metric, Live Distance, that captures the largest interval of temporalreuse for a cache block, providing a conservative estimate of a cache block’s usefullifetime. Second, Leeway implements a robust prediction mechanism that identifiesdead blocks based on their past Live Distance values. To maximize cache efficiency inthe face of variability, Leeway monitors the change in Live Distance values at runtimeusing its reuse-aware policies to adapt to the observed access patterns. Meanwhile,Leeway embeds prediction metadata with cache blocks in order to avoid critical pathhistory table look-ups on cache hits and reduce the on-chip network traffic, incontrast to the state-of-the-art techniques that access history table on every cacheaccess (including cache hits). On a variety of applications and deployment scenarios,Leeway consistently provides good performance that generally matches or exceedsthat of state-of-the-art techniques. hapter 4A Case for Domain-SpecializedCache Management

In the previous chapter, we showed that history-based predictive techniques providesignificant performance improvement over simple static and lightweight dynamictechniques for a broad range of applications. However, these history-based predictivetechniques struggle in exploiting the high reuse for certain applications for whichvariability arises due to fundamental application characteristics. In this chapter, wespecifically analyze the suitability of domain-agnostic predictive techniques for theapplications from the domain of graph analytics. We qualitatively and quantitativelyexplain why these domain-agnostic techniques are fundamentally deficient for animportant domain of graph analytics and motivate the need for software-hardwareco-design in managing LLC for graph analytics.The chapter is organized as follows. First, in Sec. 4.1.1, we discuss two importantproperties of graph datasets that influence cache efficiency. Next, we explain thebasics of data-structures used in graph processing, followed by cache access patternsof individual data-structures (Secs. 4.2 & 4.3). Finally, we highlight the challenges inimproving cache efficiency for graph analytics and discuss the limitations of priorsoftware and hardware techniques in addressing those challenges (Secs. 4.4, 4.5 & 4.6).

Graph analytics is an exciting and rapidly growing field with applications spanningdiverse areas such as uncovering latent relationships (e.g., for recommendationsystems), pinpointing influencers in social graphs (e.g., for marketing purposes),

Chapter 4. A Case for Domain-Specialized Cache Management kr pl tw sd lj wl fr mp

In Hot Vertices (%) 9 16 12 11 25 12 24 10Edges Edge Coverage (%) 93 83 84 88 81 88 86 80Out Hot Vertices (%) 9 13 10 13 26 20 18 12Edges Edge Coverage (%) 93 88 83 88 82 94 92 81

Table 4.1: Rows among others. Real-world graphs from these areas often have two distinguishingproperties, skew in their degree distribution and community structure, that influencecache efficiency while processing graphs.

A distinguishing property of graph datasets common in many graph-analyticapplications is that the vertex degrees follow a skewed power-law distribution, inwhich a small fraction of vertices, hot vertices , have many connections while themajority of vertices, cold vertices , have relatively few connections [6, 28, 61, 105, 106].Graphs characterized by such a distribution are known as natural or scale-free graphsand are prevalent in a variety of domains, including social networks, computernetworks, financial networks, semantic networks, and airline networks.Table 4.1 quantifies the skew for the datasets evaluated in this thesis (Sec. 5.4of Chapter 5 contains more details of the datasets). For example, in the sd dataset,11% of total vertices are classified as hot vertices in terms of their in-degree (13%for out-degree) distribution. These Hot vertices are connected to 88% of all in-edges(88% of all out-edges) in the graph. Similarly, in other datasets, 9%-26% of vertices areclassified as hot vertices, which are connected to 80%-94% of all edges. Real-world graphs often feature clusters of highly interconnected vertices such ascommunities of common friends in a social graph [83, 96]. Such community structureis often captured by vertex ordering within a graph dataset by placing vertices from .2. Graph Processing Basics

02 3 451 P P P P P VertexEdgePropertyReuse ID-1 ID-3(a) (b)

Figure 4.1: (a) An example graph. (b) CSR format encoding in-edges. Elements of thesame colors across the arrays, correspond to the same destination vertex. Number ofbars (labeled

Reuse ) below each element of the Property Array shows the number oftimes an element is accessed in one full iteration, where the color of a bar indicatesthe vertex making an access. the same community nearby in the memory space. At runtime, vertices that are placednearby in memory are typically processed within a short time window of each other.Thus, by placing vertices from the same community nearby in memory, both temporaland spatial locality is improved at the cache block level for such datasets.

The majority of shared-memory graph frameworks are based on a vertex-centricmodel, in which an application computes some information for each vertex based onthe properties of its neighboring vertices [42, 48, 55, 57, 62, 75]. Applications mayperform pull- or push-based computations. In pull-based computations, a vertex pullsupdates from its in-neighbors. In push-based computations, a vertex pushes updatesto its out-neighbors. This process may be iterative, and all or only a subset of verticesmay participate in a given iteration.The

Compressed Sparse Row (CSR) format is commonly used to represent graphsin a storage-efficient manner. CSR uses a pair of arrays,

Vertex and

Edge , to encodethe graph. CSR encodes in-edges for pull-based computations and out-edges for push-based computations. In this discussion, we focus on pull-based computations and notethat the observations hold for push-based computation. For every vertex, the VertexArray maintains an index that points to its first in-edge in the Edge Array. The EdgeArray stores all in-edges, grouped by destination vertex ID. For each in-edge, the EdgeArray entry stores the associated source vertex ID. Chapter 4. A Case for Domain-Specialized Cache Management

The graph applications use an additional

Property

Array(s) to hold partial or finalresults for every vertex. For example, the

PageRank application maintains two ranksfor every vertex; one computed from the previous iteration and one being computedin the current iteration. Implementation may use either two separate arrays (eachstoring one rank per vertex) or may use one array (storing two ranks per vertex).Fig. 4.1(a) and 4.1(b) respectively show a simple graph and its CSR representation forpull-based computations, along with one Property Array.

At the most fundamental level, a graph application computes a property for a vertexbased on the properties of its neighbors. To find the neighboring vertices, an applicationtraverses the portion of the Edge Array corresponding to a given vertex, and thenaccesses elements of the Property Array corresponding to these neighboring vertices.Fig. 4.1(b) highlights the elements accessed during computations for vertex ID-1 andID-3.As the figure shows, each element in the Vertex and Edge Arrays is accessed exactlyonce during an iteration, exhibiting no temporal locality at LLC. These arrays mayexhibit high spatial locality, which is filtered by the L1-D cache, leading to a streamingaccess pattern in the LLC.In contrast, the Property Array does exhibit temporal reuse. However, reuse isnot consistent for all elements. Specifically, reuse is proportional to the number ofout-edges for pull-based algorithms. Thus, the elements corresponding to high out-degree vertices exhibit high reuse. Fig. 4.1(b) shows the reuse for high out-degree (i.e.,hot) vertices P and P of the Property Array assuming pull-based computations; otherelements do not exhibit reuse. The same observation applies to high in-degree verticesin push-based algorithms.Finally, Fig. 4.2 quantifies the LLC behavior of various graph applications (Sec. 5.4of Chapter 5 contains more details of the applications) on the tw dataset as arepresentative example of real-world graph datasets. The figure differentiates all LLCaccesses and misses as falling either within or outside the Property Array.Unsurprisingly, the Property Array accounts for 78-93% of all LLC accesses. However,despite the high reuse, the Property Array is also responsible for large fraction of LLCmisses, the reasons for which are explained next. .4. Challenges in Caching the Property Array tw BC SSSP PR PRD Radii0%25%50%75%100%

Accesses outside Property ArrayAccesses within Property Array Misses outside Property ArrayMisses within Property Array

Figure 4.2: Classification of LLC accesses and misses (normalized to total accesses)for five graph applications when processing the tw dataset. As discussed in the previous section, elements in the Property Array correspondingto the hot vertices exhibit high reuse. Unfortunately, on-chip caches struggle incapitalizing on the high reuse for the two reasons: lack of spatial locality and difficultto exploit temporal locality.

A cache block is typically comprised of multiple vertices as the properties associatedwith a vertex are much smaller than the size of a cache block. Moreover, hot verticesconstitute a relatively smaller fraction of all vertices and are sparsely distributedthroughout the memory space of the Property Array. Thus, inevitably, hot verticesshare space in a cache block with cold vertices, leading to low spatial locality for hotvertices. Even when a cache block holding a hot vertex is retained in the cache, it leadsto underutilization of cache capacity as a considerable fraction of the cache block isoccupied by cold vertices that exhibit low or no reuse.Table 4.2 shows the average number of hot vertices per cache block, assumingtypical values of 8 bytes per vertex and 64 bytes per cache block. While, at best, 8

Dataset kr pl tw sd lj wl fr mp

Avg. 1.3 1.6 1.5 1.8 3.5 3.1 2.7 2.6

Table 4.2: Average number of hot vertices per cache block. Calculation assumes 8bytes per vertex and 64 bytes per cache block, and counts only cache blocks thatcontain at least one hot vertex. As a result, any cache block can contain between 1–8hot vertices. Chapter 4. A Case for Domain-Specialized Cache Management

Cache Misses (RD > 16)Cache Hits (RD <= 16)

Reuse Distance C u m u l a t i v e D i s t r i bu t i on BC SSSP PR PRD Radii

Figure 4.3: Reuse Distance Distribution on 16-way set-associative 16MB LLC for fivegraph applications, each processing the tw dataset. Vertical dotted line at reusedistance of 16 shows hit-rate under LRU management. Remaining percentage ofLLC accesses after reuse distance of 8192 are corresponding to cold misses and thus,have infinite reuse distances. hot vertices can be packed together in a cache block, in practice, only 1.3 to 3.5 hotvertices are found per cache block across the datasets. As the footprint (i.e., numberof cache blocks) to store hot vertices is inversely proportional to the average numberof hot vertices per cache block, the data shows significant opportunity in reducing thecache footprint of hot vertices, and in turn, improving cache efficiency. The access pattern to the Property Array is highly irregular, being heavily dependenton both graph structure and application. Between a pair of accesses to a given hotvertex in the Property Array, a number of other, low-/no-reuse, data elements (e.g, coldvertices or elements of the Vertex and Edge Arrays) may be accessed, increasing reusedistance of the accesses to the hot vertices. Any block allocated by these low-/no-reusedata elements will trigger evictions at the LLC, potentially displacing cache blocksholding hot vertices.Fig. 4.3 shows the cumulative reuse distance distribution of LLC accesses for fivegraph applications processing the tw dataset. For a stream of accesses to a given cacheset, the reuse distance of a cache block access is calculated as the number of unique LLCaccesses in the set since the previous LLC access to the same cache block. Thus, anyLLC access with reuse distance less than or equal to the associativity of the cache (16in the study) would result in a cache hit under LRU. As the figure shows, at most 38%of LLC accesses have reuse distance less than or equal to 16 (shown using a vertical .5. Prior Software Techniques dotted line). Meanwhile, 19%-54% of all LLC accesses have reuse distance greaterthan 64 (i.e., 4 𝑥 the associativity). Long reuse distances, along with irregular accesspatterns, lead to severe cache thrashing at LLC, making it difficult for domain-agnostictechniques to capitalize on the high reuse inherent in accesses to hot vertices.We next discuss the most relevant state-of-the-art techniques in both software andhardware that attempt to address the above mentioned challenges for graph analytics. The order of vertices in memory is under the control of a graph application. Thus,the application can reorder vertices in memory before processing a graph to improvecache locality. To accomplish this, researchers have proposed various reorderingtechniques [6, 22, 28, 30, 41, 64, 66, 109, 111, 113]. Reordering techniques only relabelvertices (and edges), which does not alter the graph itself and does not require anychanges to the graph algorithms. Following the relabeling, vertices (and edges) arereordered in memory based on the new vertex IDs.The most powerful reordering techniques like Gorder [41] leverage communitystructure, typically found in real-world graphs, to improve spatio-temporal locality.Gorder comprehensively analyzes the vertex connectivity and reorders vertices suchthat vertices that share common neighbors, and thus are likely to belong to the samecommunity, are placed nearby in memory. While Gorder is effective at reducingcache misses, it requires a staggering reordering time that is often multiple orders ofmagnitude higher than the total application runtime, rendering Gorder impractical [6].To keep the reordering cost affordable, we argue for limiting the scope of vertexreordering to induce spatial locality only while leaving the task of exploitingtemporal locality to a hardware cache management technique. We collectively referto such techniques as skew-aware reordering techniques. Unlike Gorder, skew-awarereordering techniques require lightweight analysis as these reorder vertices solelybased on vertex degrees, and thus can speed-up applications even after accounting forthe reordering time [6, 28].Existing skew-aware reordering techniques seek to induce spatial locality amonghot vertices by segregating them into a contiguous memory region. As a result, thecache footprint of hot vertices is reduced, which, in turn, improves cache efficiency.However, as a side-effect of reordering, these may destroy a graph’s communitystructure, which could negate the performance gains achieved from the reduced Chapter 4. A Case for Domain-Specialized Cache Management

BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all % M i ss e s E li m i na t ed RRIP OPT

Figure 4.4: Percentage of misses eliminated by RRIP and OPT over LRU on 16MBLLC. Trace for each application-dataset pair consists of up to 2 billion LLC accesses. footprint of hot vertices. Thus, there exists a tension between reducing the cachefootprint of hot vertices and preserving graph structure when reordering vertices,which must be addressed by a skew-aware technique in order to maximize cacheefficiency.

In the previous section, we argued for exploiting temporal locality of hot verticesthrough a hardware cache management technique to keep software reorderinglightweight. In this section, we discuss how effective existing hardware cachemanagement techniques are in exploiting temporal locality, specifically in context ofgraph analytics. (i.e., static and lightweight dynamic techniques fromSec. 2.4) use simple heuristics to manage LLC. RRIP [71] is the state-of-the-art techniquein this category that relies on a probabilistic approach to classify a cache block as lowor high reuse at the time of inserting a new block in the cache. As these techniques donot exploit the reuse behavior of cache blocks from their past generations, these arelimited in accurately identifying high-reuse blocks.We quantify the effectiveness of RRIP over LRU using a trace-based study on aset of graph applications processing various high-skew datasets as shown in Fig. 4.4.The figure plots the percentage of misses eliminated by RRIP over LRU, along withmisses eliminated by OPT [114] to show the maximum opportunity for any cachemanagement technique. RRIP consistently reduces misses over LRU across datapointswith an average miss reduction of 10.5%. Meanwhile, OPT shows that on an average,32.3% of misses can be eliminated over LRU, showing a significant opportunity inimproving cache efficiency over RRIP. such as the state-of-the-art Hawkeye [37] .6. Prior Hardware Techniques and many others [8, 18, 39, 40, 67, 73, 81] learn past reuse behavior of cache blocks byemploying sophisticated storage-intensive prediction mechanisms. A large body ofrecent works focus on history-based predictive techniques as these generally providehigher performance than the lightweight techniques for a wide range of applicationsas shown in Sec. 3.5 of Chapter 3. Meanwhile, for graph analytics, we find that graph-dependent irregular access patterns, combined with long reuse distances, preventthese predictive techniques from correctly learning which cache blocks to preserve.For example, as explained in Sec. 2.4.3 of Chapter 2, most history-based predictivetechniques rely on a PC-based correlation to learn which set of PC addresses accesshigh-reuse cache blocks to prioritize these blocks for caching over others. However,we observe that the reuse for elements of the Property Array, which are the primetarget for LLC caching in graph analytics (Sec 4.3), does not correlate with the PCbecause the same PC accesses hot and cold vertices alike.We quantify the performance of three state-of-the-art history-based predictivetechniques – SHiP-MEM, Hawkeye and Leeway. Hawkeye and Leeway rely on aPC-based reuse correlation whereas SHiP-MEM, a variant of SHiP, exploits a region-based correlation. Fig. 4.5 plots application speed-up for these techniques over RRIPfor five graph applications, each processing five graph datasets. We use RRIP as a new,stronger, baseline as RRIP consistently reduces more misses than LRU as shown inFig. 4.4.The results show that all predictive techniques on average cause slowdown over theRRIP baseline. Irregular access patterns, combined with long reuse distance accesses,impede learning of these predictive techniques, rendering them deficient for the wholedomain of graph analytics. As expected, Leeway tolerates variability in the reusebehavior the most by causing an average slowdown of 0.8% only vs 5.7% for SHiP-MEMand 14.8% for Hawkeye. Alas, Leeway causes a slowdown nonetheless. The resultshighlight that existing domain-agnostic cache management techniques are unable toexploit temporal locality despite a significant opportunity. use compiler analysis, runtime profiling or domain-knowledge of the programmers to identify high-reuse cache blocks. The majorityof these techniques target regular access patterns, making them infeasible for graphapplications that are dominated by irregular access patterns.Techniques such as XMem [10] dedicate partial or full cache capacity by pinning We use an improved, prefetch-aware, version of Hawkeye from CRC2 (i.e., Hawkeye++ fromSec. 3.6 of Chapter 3). Chapter 4. A Case for Domain-Specialized Cache Management -16 -17 -16 -22 -16 -22 -24 -13 -16 -22 -24 -13 -15 -19 -22 -15

BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all -10-50510 S peed - up ( % ) SHIP-MEM Hawkeye Leeway

Figure 4.5: Performance evaluation for state-of-the-art domain-agnostic cachemanagement techniques over RRIP. high-reuse blocks to cache. Hardware ensures that the pinned blocks cannot be evictedby other cache blocks and thus are protected from cache thrashing. Such an approachis only feasible when the high-reuse working set fits in the available cache capacity.Unfortunately, for large graph datasets, even with high skew, it is unlikely that allhot vertices will fit in the LLC; recall from Table 4.1 that hot vertices account for upto 26% of the total vertices. Moreover, some of the colder vertices might also exhibitshort-term temporal reuse, particularly in graphs with community structure.These observations call for a new LLC management technique that employs (1) areliable mechanism to identify hot vertices amidst irregular access patterns and (2)flexible cache policies that maximize reuse among hot vertices by protecting themin the cache without denying colder vertices the ability to be cache resident if theyexhibit reuse.

Graph analytics on natural graphs exhibit poor cache efficiency due to low spatiallocality and difficult to exploit temporal locality. Existing domain-agnostic hardwarecache management techniques are limited in addressing both these challenges. First,hardware alone cannot enforce spatial locality, which is dictated by vertex placementin the memory space and is under software control. Second, domain-agnostic hardwarecache management techniques struggle in pinpointing hot vertices under cachethrashing due to long reuse distance accesses and irregular access patterns endemic ofgraph analytics.Both of these challenges can be addressed by leveraging a lightweight softwaresupport. First, a skew-aware lightweight software technique can induce spatial localityby segregating hot vertices in a contiguous memory region. Second, software has theknowledge of the memory locations of hot vertices. Utilizing software knowledge can .7. Solution: Software-Hardware Co-Design enable a reliable mechanism for hardware to identify hot vertices amidst irregularaccess patterns.Based on these observations, we propose a holistic software-hardware co-design toimprove cache efficiency for graph analytics. Our software component is responsiblefor inducing spatial locality of hot vertices. The software component also facilitatesour hardware’s task of pinpointing the cache blocks containing hot vertices. While thesoftware informs hardware, the hardware is ultimately in control of deciding whichvertices to evict and which to preserve based on available cache capacity and temporalaccess patterns, thus relinquishing software from any additional runtime overhead.The end result is software that incurs minimal runtime overhead, and simple hardwarethat reliably identifies cache blocks that are likely to exhibit high reuse.In the following chapters, we discuss each of these components in detail. InChapter 5, we present DBG, a new skew-aware vertex reordering technique. InChapter 6, we introduce GRASP, domain-specialized cache management for graphanalytics. hapter 5DBG – Lightweight VertexReordering For a typical graph application, a cache block contains multiple vertices, as vertexproperties usually require just 4 to 16 bytes whereas a cache block size in modernprocessors is typically 64 or 128 bytes. Since hot vertices are sparsely distributed inmemory, and are smaller in number, they inevitably share cache blocks with coldvertices, leading to underutilization of a considerable fraction of useful cache capacity.Skew-aware techniques reorder vertices in memory such that hot vertices areadjacent to each other in a contiguous memory region. As a result, each cache block iscomprised of exclusively hot or cold vertices, reducing the total footprint (i.e., numberof cache blocks) required to store hot vertices. Blocks that are exclusively comprisedof hot vertices are far more likely to be retained in the cache due to higher aggregatehit rates, leading to higher utilization of existing cache capacity.A straightforward way to pack vertices with similar degree into each cache blockis to apply

Sort Reordering , which sorts vertices based on their degree. However, Sortis not always beneficial, because many real-world graph datasets exhibit a strongstructure, e.g., clusters of webpages within the same domain in a web graph, orcommunities of common friends in a social graph [83, 96]. In such datasets, verticeswithin the same community are accessed together, and often reside nearby in memory,exhibiting spatio-temporal locality that should be preserved. Fine-grain vertexreordering, such as Sort and Hub Sorting [28], destroys the spatio-temporal locality,which limits the effectiveness of such reordering on datasets that exhibit structure.

Chapter 5. DBG – Lightweight Vertex Reordering

In this chapter, we quantify potential performance loss due to disruption ofgraph structure on various datasets. We further characterize locality at all threelevels of the cache hierarchy, and show that all skew-aware techniques are generallyeffective at reducing LLC misses. However, techniques employing fine-grain reorderingsignificantly disrupt graph structure, increasing misses in higher-level caches. Ourresults highlight a tension between reducing the cache footprint of hot vertices andpreserving graph structure, limiting the effectiveness of prior skew-aware techniques.To overcome the limitations of prior techniques, we propose

Degree-Based Grouping ,a novel reordering technique that largely preserves graph structure while reducingthe cache footprint of hot vertices. Like prior skew-aware techniques, DBG segregateshot vertices from the cold ones. However, to preserve existing graph structure, DBGemploys coarse-grain reordering. DBG partitions vertices into a small number ofgroups based on their degree but maintains the original relative order of verticeswithin each group. As DBG does not sort vertices within any group to minimizestructure disruption, DBG also incurs a very low reordering overhead.To summarize, we make the following contributions:• We study existing skew-aware reordering techniques on a variety of multi-threaded graph applications processing varied datasets. Our characterizationreveals the inherent tension between reducing the cache footprint of hot verticesand preserving graph structure.• We propose DBG, a new skew-aware reordering technique that employslightweight coarse-grain reordering to largely preserve existing graph structurewhile reducing the cache footprint of hot vertices.• Our evaluation on a real machine shows that DBG outperforms existing skew-aware techniques. Averaging across 40 datapoints, DBG yields a speed-up of16.8%, vs 11.6% for the best-performing existing skew-aware technique over thebaseline with no reordering.

In order to provide high performance for graph applications, skew-aware reorderingtechniques should achieve all of the following three objectives: .2. Skew-Aware Reordering Techniques O1. Low Reordering Time:

Reordering time plays a crucial role in deciding whethera technique is viable in providing end-to-end application performance after accountingfor the reordering time. Lower reordering time facilitates amortizing the reorderingoverhead in a fewer graph traversals.

O2. High Cache Efficiency:

As explained in Sec. 4.4.1 of Chapter 4, a cache block iscomprised of multiple vertices. Problematically, hot vertices are sparsely distributedthroughout the memory space, which leads to cache blocks containing vertices withvastly different degrees. To address this, vertex reordering should ensure that hotvertices are placed adjacent to each other in the memory space, thus reducing thecache footprint of hot vertices, and in turn, improving cache efficiency.

O3. Structure Preservation:

As explained in Sec. 4.1.2 of Chapter 4, many real-world graph datasets have vertex ordering that results in high spatio-temporal cachelocality. For such datasets, vertex reordering should ensure that the original structure ispreserved as much as possible. If structure is not preserved, reordering may adverselyaffect the locality, negating performance gains achieved from the reduced footprint ofhot vertices.

In this section, we characterize how important it is to preserve graph structure fordifferent datasets. To quantify the potential performance loss due to reduction inspatio-temporal locality arising from reordering, we randomly reorder vertices, whichdecimates any existing structure. Randomly reordering all vertices would cause aslowdown for two potential reasons: (1) By destroying graph structure, thus reducingspatio-temporal locality. (2) By further scattering hot vertices in memory, thusincreasing the cache footprint of hot vertices. To isolate performance loss due tothe former, we also evaluate random reordering at a cache block granularity. In such areordering, cache blocks (not individual vertices) are randomly reordered in memory,which means that the vertices within a cache block are moved as a group. As a result,the cache footprint of hot vertices is unaffected, and any change in performance can bedirectly attributed to a change in graph structure. Fig. 5.2(a) illustrates vertex placementin memory after Random Reordering at a vertex and at a cache block granularity.Fig. 5.1 shows performance slowdown for Random Reordering for the Radiiapplication on all datasets listed in Table 5.7. The figure shows four configurations,Random Vertex (RV) that reorders at a granularity of one vertex and Random Cache Chapter 5. DBG – Lightweight Vertex Reordering S l o w do w n ( % ) RV RCB-1 RCB-2 RCB-4

Figure 5.1: Application slowdown after random reordering at different granularityfor the Radii application. The lower the bar, the better the application performance.

Block- 𝑛 (RCB- 𝑛 ) that reorders at a granularity of 𝑛 cache blocks, where 𝑛 is 1, 2 or 4.Performance difference between RV and RCB-1 is very large for the four right-most datasets. Recall from Table 4.2 of Chapter 4 that these datasets have relativelyhigh number of hot vertices per cache block. RV scatters the hot vertices in memory,incurring large slowdowns for these datasets.Performance slowdown for RCB-1 is significant on all real-world datasets (i.e., allbut kr ), and ranges from 9.6% to 28.5%. This slowdown can be attributed to disruption inspatio-temporal locality for the real-world datasets, confirming existence of communitystructure in the original ordering of the datasets. As reordering granularity increases,disruption in graph structure reduces, which also reduces the slowdown. For example,on the mp dataset, the most affected dataset by the Random Reordering among all,performance slowdown is 28.5% for RCB-1, which reduces to 21.6% for RCB-2 and15.6% for RCB-4.Results for kr , the only synthetic dataset in the mix, are in stark contrast withthat of the real-world datasets. As kr is generated synthetically, kr does not have anystructure in the original ordering. Thus, the performance on the kr dataset is largelyoblivious to random reordering at any granularity.The results show that the real-world graph datasets exhibit some structure in theiroriginal ordering, which, if not preserved, is likely to adversely affect the performance.The results also indicate that structure can be largely preserved by applying reorderingat a coarse granularity. This section describes the existing skew-aware techniques and how they fare inachieving the three objectives listed in Sec. 5.2.1. As skew-aware techniques solelyrely on vertex degrees for reordering, they all incur relatively low reordering time,achieving objective O1. However, for the two remaining objectives, reducing the cache .2. Skew-Aware Reordering Techniques P P P P P P P P P P P Original Ordering P P P P P P P P P P P Random - Cache Block Granularity

22 4 2 28 3 25 54 4 21 70 4 3P P P P P P P P P P P P Random - Vertex Granularity

70 54 28 25 22 21 4 4 4 3 3 2P P P P P P P P P P P P Sort

54 22 25 21 28 70 3 4 4 3 4 2P P P P P P P P P P P P HubCluster

70 54 28 25 22 21 3 4 4 3 4 2P P P P P P P P P P P P HubSort(a) (b)

Figure 5.2: Vertex ordering in memory for different techniques. Vertex degree isshown inside the box while original vertex ID is shown below the box. Hot vertices(degree ≥

20) are shown in color. Hottest among the hot vertices (degree ≥

40) areshown in a darker shade. Finally, Random (Cache Block Granularity) assumes twovertices per cache block. footprint of hot vertices and preserving existing graph structure, existing techniquestrade one for the other, hence failing to achieve at least one of the two objectives.

Sort reorders vertices based on the descending order of their degree. Sort requiresthe least possible number of cache blocks to store hot vertices without explicitlyclassifying individual vertices as hot or cold. However, sort reorders all vertices, whichcompletely destroys the original graph structure. Fig. 5.2(b) shows vertex placementin memory after the Sort Reordering.

Hub Sorting [28] (also known as Frequency-based Clustering) was proposed as avariant of Sort that aims to preserve some structure while reducing the cache footprintof hot vertices. Hub Sorting uses an average degree of the dataset as a threshold toclassify vertices as hot or cold, and only sorts the hot vertices.Hub Sorting does preserve partial structure by not sorting the cold vertices, butproblematically, the hot vertices are fully sorted. While hot vertices constitute a smallerfraction compared to the cold ones, recall from Table 4.1 of Chapter 4 that hot verticesaccount for up to 26% of the total vertices. Moreover, hot vertices are connected to thevast majority of edges (80%-94%), and thus are responsible for the majority of reuse.Consequently, preserving structure for hot vertices is also important, at which HubSorting fails.

Hub Clustering [6] is a variant of Hub Sorting that only segregates hot vertices Chapter 5. DBG – Lightweight Vertex Reordering

Per-Vertex Property kr pl tw sd lj wl fr mp

Table 5.1: Cache size (MB) needed to store all hot vertices , assuming 8 and 16 bytesper property, respectively. Vertex is classified hot if its degree is equal or greaterthan the average degree of the dataset. from the cold ones but does not sort them. While Hub Clustering was proposed as analternative to Hub Sorting that has lower reordering time, we note that Hub Clusteringis also better than Hub Sorting at preserving graph structure as Hub Clustering doesnot sort any vertices. However, by not sorting hot vertices, Hub Clustering sacrificessignificant opportunity in improving cache efficiency as discussed next.For large graph datasets, it is unlikely that all hot vertices fit in the LLC. Forexample, the sd dataset requires at least 80MB to store all hot vertices assuming only8 bytes per vertex (refer to Table 5.1 for requirements of the remaining datasets).The required capacity significantly exceeds a typical LLC size of commodity serverprocessors. As a result, all hot vertices compete for the limited LLC capacity, causingcache thrashing.Fortunately, not all hot vertices have similar reuse, as vertex degree varies vastlyamong hot vertices. Table 5.2 shows the degree distribution for just the hot vertices ofthe sd dataset. Each column in the table represents a degree range as a function of 𝔸 , the average degree of the dataset. For instance, the first column covers verticeswhose degree ranges from 𝔸 to 2 𝔸 ; these are the lowest-degree vertices among thehot ones (recall that a vertex is classified as hot if its degree is equal or greater than 𝔸 ). For a given range, the table shows number of vertices (as a percentage of totalhot vertices) whose degree is within that range. The table also shows cache capacityneeded for those many vertices assuming 8 bytes per vertex property. Unsurprisingly,given the power-law degree distribution, the table shows that the least-hot verticesare the most numerous, representing 45% of all hot vertices and requiring 35.8MBcapacity, yet likely exhibiting the least reuse among hot vertices. In contrast, verticeswith degree above 8 𝔸 (three right-most columns) are the hottest of all, constitutingjust 12% of total hot vertices ( < 10 MB footprint). Naturally, these hottest vertices arethe ones that should be retained in the cache. However, by not sorting hot vertices,Hub Cluster fails to differentiate between the most- and the least-hot vertices, hence .3. Degree-Based Grouping (DBG) Degree Range [1 𝔸 ,2 𝔸 ) [2 𝔸 ,4 𝔸 ) [4 𝔸 ,8 𝔸 ) [8 𝔸 ,16 𝔸 ) [16 𝔸 ,32 𝔸 ) [32 𝔸 , ∞ ) Vertices (%) 45% 28% 15% 7% 3% 2%Footprint 35.8 22.3 12.0 5.7 2.2 1.8

Table 5.2: Degree distribution of hot vertices for the sd dataset, whose AverageDegree ( 𝔸 ) is 20. Row denying the hottest vertices an opportunity to stay in the cache in the presence ofcache thrashing.To summarize, Sort achieves the maximum reduction in the cache footprint of hotvertices. However, in doing so, Sort completely decimates existing graph structure.Hub Sorting and Hub Clustering both classify vertices as hot or cold based on theirdegree and preserve the structure for cold vertices. However, in dealing with hotvertices, they resort to inefficient extremes. At one extreme, Hub Sorting employsfine-grain reordering that sorts all hot vertices, destroying existing graph structure.At the other extreme, Hub Clustering does not apply any kind of reordering amonghot vertices, sacrificing significant opportunity in improving cache efficiency. To address the limitations of prior skew-aware reordering techniques, we propose

Degree-Based Grouping (DBG) , a novel skew-aware technique that applies coarse-grainreordering such that each cache block is comprised of vertices with similar degree,and in turn, similar hotness, while also preserving graph structure at large.Unlike Hub Sorting and Hub Clustering, which rely on a single threshold to classifyvertices as hot or cold, DBG employs a simple binning algorithm to coarsely partitionvertices into different groups (or bins) based on their hotness level. Groups are assignedexclusive degree ranges such that the degree of any vertex falls within a degree rangeof exactly one group. Within each group, DBG maintains the original relative order ofvertices to preserve graph structure at large. To keep the reordering time low, DBGmaintains only a small number of groups and does not sort vertices within any group.Listing 5.1 presents the formal DBG algorithm.To assign degree ranges to different groups, DBG leverages the power-lawdistribution of vertex connectivity in natural graphs. For example, recall Table 5.2, Chapter 5. DBG – Lightweight Vertex Reordering G ( V , E ) where Graph G has V vertices and E edges. Input : Degree Distribution D [], where D [v] is degree of vertex v. Output : Mapping M [], where M [v] is the new ID of vertex v. DBG : Binning algorithm to reorder vertices into K groups ( K > P 𝑘 , Q 𝑘 ) to every Group 𝑘 such that, Q > max( D []) & P 𝐾 ≤ min( D []) & Q 𝑘+1 = P 𝑘 < Q 𝑘 , for every k < K

2: For every vertex v from 1 to V Append v to the

Group 𝑘 for which D [v] ∈ [ P 𝑘 , Q 𝑘 ).3: Assign new IDs to all vertices as follows:id := 1For every Group 𝑘 from 1 to K For every vertex v in

Group 𝑘 M [v] := id++, where v is the original ID Listing 5.1: DBG algorithm. Degree can be in-degree or out-degree or sum of both. P P P P P P P P P P P Original Ordering

54 70 22 25 21 28 3 4 4 3 4 2P P P P P P P P P P P P DBG

Figure 5.3: Vertex ordering in memory after DBG. In this example, DBG partitionsvertices into three groups with degree ranges [0, 20), [20, 40) and [40, 80). DBGmaintains a relative order of vertices within a group. As a result, many vertices areplaced nearby the same vertices as before the reordering such as vertex sets (P , P ,P ), (P , P ) and (P , P ). which shows distribution of hot vertices across different degree ranges. Vertices withthe smallest degree range constitute the largest fraction of hot vertices. As degreerange doubles, the number of vertices are roughly halved, exhibiting the power-lawdistribution. Thus, geometrically-spaced degree ranges provide a natural way tosegregate vertices with different levels of hotness. At the same time, using such wideranges to partition vertices facilitates reordering at a very coarse granularity,preserving structure at large. Meanwhile, by not sorting vertices within any group,DBG incurs a very low reordering time. Thus, DBG successfully achieves all threeobjectives listed in Sec. 5.2.1. Fig. 5.3 shows vertex placement in memory after the .3. Degree-Based Grouping (DBG) Reordering

𝕄+1 [ 𝑛 , 𝑛 + 1 ) where n ∈ [0, 𝕄 ]Hub Sorting 𝕄 - 𝔸+2 [0, 𝔸 ), [ 𝑛 , 𝑛 + 1 ) where n ∈ [ 𝔸 , 𝕄 ]Hub Clustering 2 [0, 𝔸 ), [ 𝔸 , 𝕄 ]DBG ⌊𝑙𝑜𝑔 ⌋ + 2 [0, ℂ ), [2 𝑛 ℂ , 2 𝑛+1 ℂ ) where n ∈ [ ⌊𝑙𝑜𝑔 ⌋ ] Table 5.3: Implementation of various skew-aware techniques using DBG algorithm. 𝔸 is the average and 𝕄 is the maximum degree of the dataset. For DBG, ℂ is somethreshold such that 0 < ℂ < 𝕄 . DBG Reordering, for a synthetic example.Finally, we note that the DBG algorithm (Listing 5.1) provides a generalframework to understand trade-offs between reducing the cache footprint of hotvertices and preserving graph structure just by varying a number of groups and theirdegree ranges. Indeed, Table 5.3 shows how different skew-aware techniques can beimplemented using the DBG algorithm. For example, Hub Clustering can be viewedas an implementation of DBG algorithm with two groups, one containing hot verticesand another one containing cold vertices. Similarly, Sort can be seen as animplementation of DBG algorithm with as many number of groups as many uniquedegrees a given dataset has. Consequently, for a given unique degree, the associatedgroup contains all vertices having the same degree, effectively sorting vertices bytheir degree. In general, as the number of groups is increased, the degree range getsnarrower and vertex reordering gets finer, causing more disruption to existingstructure. Table 5.4 qualitatively compares DBG to prior techniques.

Technique Structure Reordering NetPreservation Time PerformanceSort (cid:55) (cid:51) (cid:51)

Hub Sorting [28] (cid:51) (cid:51) (cid:51)

Hub Clustering [6] (cid:51)(cid:51) (cid:51)(cid:51) (cid:51)

DBG (proposed) (cid:51)(cid:51) (cid:51)(cid:51) (cid:51)(cid:51)

Gorder [41] (cid:51)(cid:51) (cid:55) (cid:55)

Table 5.4: Qualitative performance of different reordering techniques for graphanalytics on natural graphs. Chapter 5. DBG – Lightweight Vertex Reordering

Application Brief DescriptionBetweennessCentrality(BC) finds the most central vertices in a graph by using a BFS kernel tocount the number of shortest paths passing through each vertexfrom a given root vertex.Single SourceShortest Path (SSSP) computes shortest distance for vertices in a weighted graph from agiven root vertex using the Bellman Ford algorithm.PageRank(PR) is an iterative algorithm that calculates ranks of vertices based onthe number and quality of incoming edges to them [108].PageRank-Delta(PRD) is a faster variant of PageRank in which vertices are active in aniteration only if they have accumulated enough change in theirPageRank score.Radii Estimation(Radii) estimates the radius of each vertex by performing multiple parallelBFS’s from a small sample of vertices [77].

Table 5.5: A list of evaluated graph applications.

Per-Vertex Property Size (Bytes) DegreeGraphApplication ComputationType AllProperties Only Properties withIrregular Accesses Type used forReorderingBC pull-push 17 8 outSSSP push-only 8 8 inPR pull-only 20 12 outPRD push-only 20 8 inRadii pull-push 20 8 out

Table 5.6: Properties of graph applications. In addition to the vertex properties, allgraph applications require 4 bytes to encode a vertex and 8 bytes to encode an edge.

For the evaluation, we use

Ligra [57], a widely used shared-memory graph processingframework that supports both pull- and push-based computations, including switchingfrom pull to push (and vice versa) at the start of a new iteration. We evaluate variousreordering techniques using five iterative graph applications listed in Table 5.5, oneight graph datasets listed in Table 5.7, resulting in 40 datapoints for each technique.Table 5.6 lists various properties for the Ligra implementation of the evaluated graph .4. Methodology Dataset Vertex Edge Avg. Type OriginalCount Count Degree OrderingKron ( kr ) [42]

67M 1,323M

20 Synthetic UnstructuredPLD ( pl ) [7]

43M 623M

15 Real UnstructuredTwitter ( tw ) [74]

62M 1,468M

24 Real UnstructuredSD ( sd ) [7]

95M 1,937M

20 Real UnstructuredLiveJournal ( lj ) [51]

5M 68M

14 Real StructuredWikiLinks ( wl ) [26]

18M 172M fr ) [33]

64M 2,147M

33 Real StructuredMPI ( mp ) [23]

53M 1,963M

37 Real Structured

Table 5.7: Properties of the evaluated graph datasets. We empirically label thosedatasets as sturctured on which Random Reordering (RV) causes more than 25%slowdown (Fig. 5.1).

Dataset Vertex Edge Avg. TypeCount Count DegreeUniform ( uni ) [45]

50M 1,000M road ) [47]

24M 29M

Table 5.8: Properties of the no-skew graph datasets. The uni dataset is generatedusing R-MAT [92] methodology with parameter values of A=B=C=25. applications.We obtained the source code for the graph applications from Ligra [57].Implementation of the graph applications is unchanged except for an addition of anarray to keep a mapping between the vertex ID assignments before and after thereordering. The mapping is needed to ensure that root-dependent traversalapplications running on the reordered graph datasets use the same root as thebaseline execution running on the original graph dataset. We compile theapplications using g++-6.4 with O3 optimization level on Ubuntu 14.04.1 booted withLinux kernel 4.4.0-96-lowlatency and use OpenMP for parallelization. To utilizememory bandwidth from both sockets, we run every application under NUMAinterleave memory allocation policy. Chapter 5. DBG – Lightweight Vertex Reordering

Evaluation is done on a dual-socket server with two Broadwell based

Intel XeonCPU E5-2630 [36], each with 10 cores clocked at 2.2GHz and a 25MB shared LLC.Hyper-threading is kept on, exposing 40 hardware execution contexts across bothCPUs. Server has 128GB of DRAM provided by eight DIMMs clocked at 2133MHz.Applications use 40 threads, and the threads are pinned to avoid performance variationsdue to OS scheduling. To further reduce sources of performance variation, DVFSfeatures are disabled. Finally,

Transparent Huge Pages is kept on to reduce TLB misses.We evaluate each reordering technique on every combination of graph applicationsand graph datasets 11 times, and record the average runtime of 10 executions, excludingthe timing of the first execution to allow the caches to warm up. We report the speed-up over the entire application runtime (with and without reordering cost) but excludethe graph loading time from the disk. For iterative applications, PR and PRD, we runthem until convergence and consider the aggregate runtime over all iterations. Forroot-dependent traversal applications, SSSP and BC, we run them from eight differentroot vertices for each input dataset and consider the aggregate runtime over all eighttraversals. Finally, we note that the application runtime is relatively stable acrossexecutions. For each reported datapoint, coefficient of variation is at most 2.3% forPRD and at most 1.6% for other applications.

We evaluate DBG and compare it with all three existing skew-aware techniquesdescribed in Sec. 5.2.3 (Sort, HubSort [28] and HubCluster [6]) along with Gorder [41],the state-of-the-art structure-aware reordering technique.We use the source code available from https://github.com/datourat/Gorder forGorder. As Gorder is only available in a single-thread implementation, whilereporting the reordering time of Gorder for a given dataset, we optimistically dividethe reordering time by 40 (maximum number of threads supported on the server) toprovide a fair comparison with skew-aware techniques whose reorderingimplementation is fully parallelized.For DBG, we use 8 groups with the ranges [32 𝔸 , ∞ ), [16 𝔸 , 32 𝔸 ), [8 𝔸 , 16 𝔸 ), [4 𝔸 ,8 𝔸 ), [2 𝔸 , 4 𝔸 ), [1 𝔸 , 2 𝔸 ), [ 𝔸 /2, 𝔸 ) and [0, 𝔸 /2), where 𝔸 is the average degree of thegraph dataset. Note that we also partition cold vertices into two groups. We developeda multi-threaded implementation of DBG, which is available at https://github.com/ .5. Evaluation -10010203040 kr pl tw sd lj wl fr mp GMean S peed - up ( % ) HubSort-O HubSort HubCluster-O HubCluster

Figure 5.4: Application speed-up over the baseline with no reordering. Techniqueswith suffix O use their original implementations whereas techniques without anysuffix are implemented using DBG algorithm as per Table 5.3. The bars for thedatasets show geometric mean of speed-ups across five applications for a dataset.

Technique kr pl tw sd lj wl fr mp

HubSort-O 1.02 1.04 1.01 1.02 1.09 0.79 1.04 1.01HubSort 0.80 0.82 0.84 0.84 0.87 0.91 0.90 0.89HubCluster-O 0.78 0.79 0.81 0.81 0.78 0.56 0.88 0.87HubCluster 0.77 0.74 0.81 0.78 0.76 0.81 0.84 0.82

Table 5.9: Reordering time for existing skew-aware techniques, normalized to thatof Sort. Lower is better. faldupriyank/dbg.Finally, we implement HubSort and HubCluster using the DBG algorithm as shownin Table 5.3. We found our implementations to be more effective than the originalimplementations (referred to as HubSort-O and HubCluster-O) provided by the authorsof HubCluster. Fig. 5.4 shows application speed-up over the baseline with no reordering.Table 5.9 shows reordering time normalized to that of Sort. As our implementationof both techniques provides better speed-ups and lower reordering time, we use ourimplementations in the main evaluation.

In this section, we evaluate the effectiveness of DBG against the state-of-the-artreordering techniques. In Sec. 5.5.1, we compare the application speed-up for thesetechniques without considering the reordering time . In Sec. 5.5.2 and Sec. 5.5.3, weanalyze different levels of cache hierarchy to understand the sources of performancevariation. Subsequently, to understand the effect of the reordering time on end-to-endperformance, we compare the application speed-up after accounting for the reordering Chapter 5. DBG – Lightweight Vertex Reordering

97 -55 108

BC SSSP PR PRD Radii GMean kr pl tw sd kr pl tw sd kr pl tw sd kr pl tw sd kr pl tw sd unstruct.-20020406080 S peed - up ( % ) Sort HubSort HubCluster DBG Gorder (a)

Unstructured datasets.

30 -33-38-38 3527

BC SSSP PR PRD Radii GMean lj wl fr mp lj wl fr mp lj wl fr mp lj wl fr mp lj wl fr mp struct.-20-1001020 S peed - up ( % ) Sort HubSort HubCluster DBG Gorder (b)

Structured datasets.

Figure 5.5: Application speed-up (excluding reordering time) for reorderingtechniques over the baseline with no reordering. time in Sec. 5.5.4.

Fig. 5.5 shows application speed-up excluding reordering time for various datasets.Averaging across all 40 datapoints (combining all structured and unstructured), DBGprovides 16.8% speed-up over the baseline with no reordering, outperforming allexisting skew-aware techniques: Sort (8.4%), HubSort (7.9%) and HubCluster (11.6%).Gorder, which comprehensively analyzes graph structure, yields 18.6% average speed-up, marginally higher than that of DBG. We next analyze performance variationsacross datasets and applications.

Unstructured vs Structured

As shown in Fig. 5.5(a), on unstructured datasets, all reordering techniques providepositive speed-ups for all applications except for PRD. Sec. 5.5.3 explains the reasonsfor slowdowns for the PRD application. Among skew-aware techniques, DBG providesthe highest average speed-up of 28.1% in comparison to 22.1% for Sort, 19.8% forHubSort and 18.3% for HubCluster.On synthetic dataset kr , all techniques except HubCluster provide similar .5. Evaluation speed-ups as kr is largely insensitive to structure preservation. Similarly, on otherunstructured datasets, as hot vertices are relatively more scattered in memory (seeTable 4.2 of Chapter 4), the benefit of vertex packing outweighs potential slowdowndue to structure disruption. Thus, Sort, despite completely decimating the originalgraph structure, outperforms HubSort and HubCluster on more than half datapoints.Meanwhile, DBG, which also preserves graph structure while reducing the cachefootprint of hot vertices, provides higher performance than Sort on more thanhalf datapoints.Overall, DBG provides more than 30% speed-up over the baseline on half datapoints.DBG outperforms or matches existing skew-aware techniques on nearly all datapoints.Over the best performing prior skew-aware technique, DBG provides the highestperformance improvements on the SSSP application, with maximum speed-up of 18.0%on the tw dataset.Structured datasets exhibit high spatio-temporal locality in their original ordering.Thus, any technique that does not preserve the graph structure is likely to yield only amarginal speed-up, if any. Among skew-aware techniques, DBG provides the highestaverage speed-up of 6.5% in comparison to -3.7% for Sort, -2.8% for HubSort and 5.3%for HubCluster.On structured datasets, performance gains from the reduction in footprint of hotvertices are negated by the disruption in graph structure. Thus, Sort and HubSort,which preserve graph structure the least, cause slowdown (up to 38.4%) on morethan half datapoints. DBG, in contrast, successfully avoids slowdown on almost alldatapoints and causes a marginal slowdown (up to 4.9%) only on 4 datapoints. DBG vs Gorder

Gorder comprehensively analyzes vertex connectivity to improve cache localitywhereas DBG reorders vertices solely based on their degrees. Thus, it is expected forGorder to outperform DBG (and other skew-aware techniques). On average, Gorderyields a speed-up of 31.5% (vs 28.1% for DBG) for unstructured datasets and 6.9% (vs6.5% for DBG) for structured datasets.Specifically, difference in speed-ups for DBG and Gorder is very small for datasets kr , tw , wl and mp . These datasets have relatively small clustering coefficient comparedto other datasets [9], which makes it difficult for Gorder to approximate suitablevertex ordering. On other datasets, Gorder provides significantly higher speed-upsthan any skew-aware technique. Problematically, Gorder incurs staggering reordering Chapter 5. DBG – Lightweight Vertex Reordering uni road

BC SSSP PR PRD Radii GMean BC SSSP PR PRD Radii GMean -8-4048 S peed - up ( % ) Sort HubSort HubCluster DBG Gorder

Figure 5.6: Effect of reordering techniques on graph datasets having no skew. overhead, and thus causes severe slowdowns when accounted for its reordering time(see Sec. 5.5.4), making it impractical.

Reordering on No-Skew Graphs

In this section, we evaluate the effect of reordering techniques on graph datasetsthat have no skew. Skew-aware techniques are not expected to provide significantspeed-up for these datasets due to lack of skew in their degree distribution. Moreimportantly, these techniques are also not expected to cause any significant slowdowndue to a nearly complete lack of locality in the baseline ordering to begin with.Fig. 5.6 shows speed-ups for reordering techniques on two datasets, uni and road ,listed in Table 5.8. As expected, all skew-aware techniques have a relatively neutraleffect, with an average change in execution time within 1.2% on the uni dataset andwithin 0.4% on the road dataset. Meanwhile, Gorder yields slightly more speed-up (3.5%on both uni and road datasets), as it can exploit fine-grain spatio-temporal locality,which is not entirely skew dependent.

In this section, we explain the sources of performance variations for differentreordering techniques by analyzing their effects on all three levels of the cachehierarchy. Fig. 5.7 plots

Misses Per Kilo Instructions (MPKI) for L1, L2 and L3 cache,measured using hardware performance counters, for the PR application as arepresentative example.In the baseline with the original ordering, on all datasets except lj and wl , L1 MPKIis more than 100 (i.e., at least 1 L1 miss for every 10 instructions on average), whichconfirms the memory intensive nature of graph applications. For the original ordering,L2 MPKI is only marginally lower than L1 MPKI across datasets, which shows thatalmost all memory accesses that miss in the L1 cache also miss in the L2 cache. AsL3 cache is significantly larger than L2 cache, L3 MPKI is much lower than L2 MPKI; .5. Evaluation (a) L1 MPKI (b) L2 MPKI (c) L3 MPKI kr pl tw sd lj wl fr mp kr pl tw sd lj wl fr mp kr pl tw sd lj wl fr mp0306090120 M PK I Original Sort HubSort HubCluster DBG Gorder

Figure 5.7: Misses Per Kilo Instructions (MPKI) for the PR application across datasets.Lower is better. nonetheless, L3 MPKI is very high for the original ordering, ranging from 56.2 to 82.9across large datasets (excluding lj and wl ).While all skew-aware techniques target L3 cache, we observe that analyzing theeffect of reordering on all three cache levels is necessary to understand applicationperformance. For example, for wl dataset, Sort yields 5.5% reduction in L3 MPKI overthe baseline and yet causes a slowdown of 5.1%. In fact, the slowdown is caused by15.3% and 19.6% increase in L1 and L2 MPKI, respectively, over the baseline.All skew-aware techniques are generally effective in reducing L3 MPKI on alldatasets but lj . On unstructured datasets (the left-most four datasets), all skew-awaretechniques reduce L1 and L2 MPKI, with the highest reduction on the sd dataset.Meanwhile, on structured datasets (the right-most four datasets), Sort and HubSort,which do not preserve graph structure, significantly increase L1 and L2 MPKI (increaseof 5.7 to 27.6 over original ordering). In contrast, HubCluster and DBG, which largelypreserve existing structure, only marginally increase L1 and L2 MPKI (difference of-2.0 to 7.5) on structured datasets.

As seen in Fig. 5.5, all reordering techniques slowdown the PRD application on manydatasets, the cause of which can be attributed to the push-based computation modelemployed by PRD. In push-based computations, when a vertex pushes an updatethrough the out-edges, it generates scattered or irregular write accesses (as opposedto irregular read accesses in pull-based computations). As different threads mayconcurrently update the same vertex (true sharing) or update different vertices inthe same cache block (false sharing), the push-based model leads to read-write orwrite-write sharing, hence generating on-chip coherence traffic.Fig. 5.8 quantifies coherence traffic on both push-dominated applications, SSSP Chapter 5. DBG – Lightweight Vertex Reordering (a) Original Ordering (b) DBG Reordering kr pl tw sdSSSPlj wl fr mp kr pl tw sdPRD lj wl fr mp kr pl tw sdSSSPlj wl fr mp kr pl tw sdPRD lj wl fr mp 0255075100 B r ea k - up o f L2 M i ss e s ( % ) L3 Hits Snoops (within socket) Snoops (remote socket) Off-chip Accesses

Figure 5.8: Break-up of L2 misses for the push-dominated applications (SSSP andPRD) for datasets with original and DBG ordering, normalized to the L2 misses ofthe original ordering. and PRD. The figure shows the break-up of L2 misses into four categories – L3 Hits(served by L3 without requiring any snoops to other cores), Snoops to other coreswithin the same socket, Snoops to another socket and off-chip accesses. For the firstthree categories, data is served by an on-chip cache whereas for the last category, datais served from the main memory.The two push-dominated applications have strikingly different fraction ofcoherence traffic while processing the datasets with the original ordering (middle twostacked bars in Fig. 5.8(a)). For SSSP, a relatively small fraction of L2 misses (14.5% for lj and below 9% for other datasets) required snoops whereas for PRD, a considerablefraction of L2 misses (from 26.9% for fr to 69.4% for wl ) required snoops.While processing a vertex using push-based computations, an application pushesupdates (writes) to some or all destination vertices of the out-edges. In the case ofPRD, it unconditionally pushes an update (i.e., a PageRank score) to all destinationvertices while processing a vertex. In contrast, SSSP pushes an update to an out-edgeonly if it finds a shorter path through that edge. Thus, SSSP has much fewer numberof irregular writes, and in turn, less coherence traffic, in comparison to PRD.Fig. 5.8(b) shows a similar break-up for SSSP and PRD on the datasets after DBGreordering. For PRD, DBG consistently reduces off-chip accesses (top stacked bar)across datasets, thus, a significantly higher fraction of requests are served by on-chipcaches. However, most of these requests (37.8% to 77.0% of L2 misses) incur a snooplatency. For example, for DBG, while processing the pl dataset, 65.4% (vs 49.2% for theoriginal ordering) of L2 misses are served by on-chip caches (bottom three stackedbars combined). However, most of these on-chip hits required snooping to othercores, incurring high access latency. Specifically, only 18.9% (vs 14.8% for the originalordering) of total L2 misses are served without requiring snooping. For most of the .5. Evaluation -91 -82 -89 -93 -91 -75 -79 -90 -49 -41 -48 -88 -88 -88 -88-40-40 -96 -93 -94 -97 -86 BC SSSP PR PRD Radii GMean tw sd fr mp tw sd fr mp tw sd fr mp tw sd fr mp tw sd fr mp all-30-1501530 S peed - up ( % ) Sort HubSort HubCluster DBG Gorder

Figure 5.9: Net speed-up for software reordering techniques over the baseline withoriginal ordering of datasets. GMean shows geometric mean across speed-ups forall five applications on four datasets. datasets, increase in L3 hits (i.e., no snooping) due to DBG is relatively small despite asignificant reduction in off-chip accesses, which explains the marginal speed-up forDBG for the PRD application (Fig. 5.5).For SSSP, most of the savings in off-chip accesses directly translate to L3 hits (i.e.,no snooping) as the application does not exhibit high amount of coherence traffic evenin the baseline. Thus, DBG is highly effective on SSSP, despite being dominated bypush-based computations.

Fig. 5.9 shows end-to-end application speed-up for different reordering techniquesafter accounting for the reordering time. Without loss of generality, we show fourdatasets (two largest unstructured and two largest structured datasets).Gorder, while more effective at improving application speed-up (Fig. 5.5), whenaccounted for its reordering time, causes severe slowdowns (up to 96.5%) acrossdatasets, corroborating prior work [6]. In contrast, all skew-aware techniques providea net speed-up on at least some datapoints.DBG outperforms all prior techniques on 17 out of 20 datapoints. DBG providesa net speed-up (up to 31.4%) on 14 out of 20 datapoints, even after accounting forits reordering time. On the remaining 6 datapoints, DBG reduces slowdown whencompared to prior techniques, with maximum slowdown of 15.6% for the Radiiapplication on the mp dataset and below 10% for others. In contrast, existing skew-aware techniques cause slowdown of up to 40.2% on half datapoints. Overall, DBGis the only technique that yields an average net speed-up (6.2%) by providing highperformance while incurring low reordering overhead.We next study how long it takes to amortize the reordering cost for an iterative Chapter 5. DBG – Lightweight Vertex Reordering

Dataset Sort HubSort HubCluster DBG Gorder tw sd fr mp Table 5.10: Minimum number of iterations needed for the PR application to amortizethe reordering time of different reordering techniques. application (PR) and a root-dependent traversal application (SSSP).

The PR application has the largest runtime among all five applications for any givendataset, thus all skew-aware techniques are highly effective for the PR application andyield a net speed-up on all four datasets. Averaging across four datasets for the PRapplication, DBG outperforms all reordering techniques with 21.2% speed-up vs 15.1%for Sort, 16.3% for HubSort, 11.6% for HubCluster and -41.3% for Gorder.Table 5.10 lists the minimum number of iterations needed for the PR applicationto amortize the cost of different reordering techniques. For all four datasets, DBGis quickest in amortizing its reordering time, providing a net speed-up for all fourdatasets after just 2-5 iterations.

We now evaluate net performance sensitivity to the number of successive graphtraversals for different techniques for the SSSP application. The runtime for root-dependent applications depends on the number of traversals (or queries) performedfrom different roots. The exact number of traversals required depends on the specificuse case. Thus, we perform a sensitivity analysis by varying the number of traversalsfrom 1 to 32 in multiples of 8.As shown in Fig. 5.10, with the increase in the number of traversals, performancefor each technique also increases, as the reordering needs to be applied only once andits cost is amortized over multiple graph traversals. Thus, a single traversal is the worst-case scenario, with all techniques causing slowdown due to their inability to amortizethe reordering cost. Of all the techniques, DBG causes the minimum slowdown (20.6%on average vs 27.7% for the next best) and is the quickest in amortizing the reordering .6. Related Work -99 -97 -96 -99 -98 -91 -75 -79 -90 -85 -84 -55 -65 -82 -74 -73 -45 -68 -58 tw sd fr mp gm tw sd fr mp gm tw sd fr mp gm tw sd fr mp gm-40-2002040 S peed - up ( % ) Sort HubSort HubCluster DBG Gorder

Figure 5.10: Net speed-up for reordering techniques over the baseline with noreordering for SSSP with different number of traversals. cost, providing an average speed-up of 11.5% (vs 2.1% for the next best) with as few as8 graph traversals.

A significant amount of research has focused on designing high performance softwareframeworks for graph applications (e.g., [42, 48, 55, 57, 62, 75]). In this section, wehighlight the most relevant works that focus on improving cache efficiency for graphapplications.

Graph slicing:

Researchers have proposed graph slicing that slices the graph in LLC-size partitions and processes one partition at a time to nullify the effect of irregularmemory accesses [28, 34, 48]. While generally effective, slicing has two importantlimitations. First, it requires invasive framework changes to form the slices (whichmay include replicating vertices to avoid out-of-slice accesses) and manage them atruntime. Secondly, for a given cache size, the number of slices increases with the sizeof the graph, resulting in greater processing overheads in creating and maintainingpartitions for larger graphs. In comparison, DBG only requires a preprocessing passover the graph dataset to relabel vertex IDs and does not require any change in thegraph algorithms.

Traversal scheduling:

Mukkara et al. proposed Bounded Depth-First Scheduling(BDFS) to exploit cache locality for graphs exhibiting community structure [9].Problematically, the software implementation of BDFS introduces significantbook-keeping overheads, causing slowdowns despite improving cache efficiency. Toavoid software overheads, the authors propose an accelerator that implements BDFSscheduling in hardware. In comparison, DBG is a software technique that canimprove application performance without any additional hardware support. Chapter 5. DBG – Lightweight Vertex Reordering

In this chapter, we studied existing skew-aware reordering techniques that seek toimprove cache efficiency for graph analytics by reducing the cache footprint of hotvertices. We demonstrated the inherent tension between reducing the cache footprintof hot vertices and preserving original graph structure, which limits the effectivenessof existing skew-aware reordering techniques. In response, we proposed Degree-BasedGrouping (DBG), a lightweight vertex reordering software technique that employscoarse-grain reordering to preserve graph structure while reducing the cache footprintof hot vertices. On a variety of graph applications and datasets, DBG achieves higheraverage performance than all existing skew-aware techniques and nearly matches theaverage performance of the state-of-the-art complex reordering technique. hapter 6GRASP – Domain-Specialized CacheManagement

Almost all prior works on hardware cache management targeting cache thrashing are domain-agnostic [8, 18, 37, 39, 40, 54, 59, 63, 67, 69, 71, 73, 76, 78, 80, 81, 82, 85, 86, 87,88, 89, 97, 103, 110]. These hardware techniques aim to perform two tasks: (1) identifycache blocks that are likely to exhibit high reuse, and (2) protect high reuse cacheblocks from cache thrashing. To accomplish the first task, these techniques deployeither probabilistic or prediction-based hardware mechanisms [8, 37, 39, 40, 67, 71,73, 81, 86]. However, as we showed in Chapter 4, graph-dependent irregular accesspatterns, combined with long reuse distance of accesses, prevent these techniquesfrom correctly learning which cache blocks to preserve, rendering them deficient forthe broad domain of graph analytics. Meanwhile, to accomplish the second task, recentwork proposed pinning of high-reuse cache blocks in LLC to ensure that these blocksare not evicted [10]. However, we find that pinning-based techniques are overly rigidand result in sub-optimal utilization of cache capacity.To overcome the limitations of existing hardware cache management techniques,we propose

GRASP – GRAph-SPecialized cache management at the LLC. To the best ofour knowledge, this is the first work to introduce domain-specialized cachemanagement for the domain of graph analytics. GRASP augments existing cacheinsertion and hit-promotion policies to provide preferential treatment to the cacheblocks containing hot vertices to shield them from thrashing. To cater to the irregularaccess patterns, GRASP policies are designed to be flexible to cache other blocks

Chapter 6. GRASP – Domain-Specialized Cache Management exhibiting reuse. By not relying on pinning, GRASP maximizes cache efficiency basedon observed access patterns.GRASP relies on lightweight software support to accurately pinpoint hot verticesamidst irregular access patterns, in contrast to history-based predictive techniquesthat rely on storage-intensive hardware mechanisms. By leveraging vertex reorderingtechniques such as DBG, GRASP enables a lightweight software-hardware interfacecomprising of only a few configurable registers, which are programmed by softwareusing its knowledge of the graph data structures.GRASP requires minimal changes to the existing microarchitecture as GRASPonly augments existing cache policies and its interface is lightweight. GRASP doesnot require additional metadata in the LLC or storage-intensive prediction tables.Thus, GRASP can easily be integrated into commodity server processors, enablingdomain-specific acceleration for graph analytics at minimal hardware cost.To summarize, our contributions are as follows:• We qualitatively and quantitatively show that a wide range of prior domain-agnostic hardware cache management techniques, despite their sophisticatedprediction mechanisms, are inefficient for the domain of graph analytics.• We propose

GRASP , graph-specialized LLC management for graph analytics onnatural graphs. GRASP augments existing cache policies to protect hot verticesagainst thrashing while also maintaining flexibility to capture reuse in othercache blocks. GRASP employs a lightweight software interface to pinpointhot vertices amidst irregular accesses, which eliminates the need for metedatastorage at the LLC, keeping the existing cache structure largely unchanged.• Our evaluation on several multi-threaded graph applications operating on large,high-skew datasets shows that GRASP outperforms state-of-the-art domain-agnostic techniques on all datapoints, yielding an average speed-up of 4.2%(max 9.4%) over the best-performing prior technique. GRASP is also robust onlow-/no-skew datasets whereas prior techniques consistently cause a slowdown.

This chapter introduces GRASP, graph-specialized cache management at LLC for graphanalytics processing natural graphs. GRASP augments existing cache management .2. GRASP: Caching In on the Skew Property Array

Original Ordering

Property Array

AfterVertex Reordering

LLC sizeHotVerticesColdVertices (a) SW View (c) HW View

Property ArrayStartEndAddress BoundRegisters (b) SW-HW Interface

High ReuseModerateReuse

HighestdegreeLowestdegreeVertices LLC size

Figure 6.1: GRASP overview. (a) Software applies vertex reordering, whichsegregates hot vertices at the beginning of the array. (b) GRASP interface exposesan ABR pair per Property Array to be configured with the bounds of the array. (c)GRASP identifies regions exhibiting different reuse based on an LLC size. techniques with simple modifications to their insertion and hit-promotion policies thatprovide preferential treatment to the cache blocks containing hot vertices to protectthem from thrashing. GRASP policies are sufficiently flexible to capture reuse of otherblocks as needed.GRASP’s domain-specialized design is influenced by the following two challengesfaced by existing hardware cache management techniques. First, hardware alonecannot enforce spatial locality, which is dictated by vertex placement in the memoryspace and is under software control. Second, domain-agnostic hardware cachemanagement techniques struggle in pinpointing hot vertices under cache thrashingdue to irregular access patterns endemic of graph analytics.To overcome both challenges, GRASP relies on skew-aware reordering techniquesto induce spatial locality by segregating hot vertices in a contiguous memory region.While these techniques offer different trade-offs in terms of reordering cost and theirability to preserve graph structure, they all work by isolating hot vertices from thecold ones. Fig. 6.1(a) shows a logical view of the placement of hot vertices in theProperty Array after reordering by such a technique. GRASP subsequently leveragesthe contiguity among hot vertices in the memory space to (1) pinpoint them via alightweight interface and (2) protect them from thrashing. GRASP design consists ofthree hardware components as follows.

A Software-hardware interface:

GRASP interface is minimal, consisting of a few Chapter 6. GRASP – Domain-Specialized Cache Management

MMU/TLBL1-D CacheCPU Core LLCVirtual Address (VA)ABRs & Classification logicPhysical Address (PA)LLC Request (PA, hint) Reuse Hint2-bits GRASPpoliciesstartend

Figure 6.2: Block diagram of GRASP and other hardware components with which itinteracts. GRASP components are shown in color. For brevity, the figure shows onlyone CPU core. configurable registers that software populates with the bounds of the Property Arrayduring the initialization of an application (see Fig. 6.1(b)). Once populated, GRASPdoes not rely on any further intervention from software.

B Classification logic:

GRASP logically partitions the Property Array intodifferent regions based on expected reuse. (See Fig. 6.1(c)). GRASP implements simplecomparison-based logic, which, at runtime, checks whether a cache request belongsto any one of these regions.

C Specialized cache policies:

GRASP specializes cache policies for each region toensure hot vertices are protected from thrashing while retaining flexibility in cachingother blocks. The classification logic guides which policy to apply to a given cacheblock.Fig. 6.2 shows how GRASP interacts with other hardware components in the system.In the following sections, we describe each of GRASP’s components in detail.

GRASP’s interface consists of one pair of

Address Bound Registers (ABR) per PropertyArray; recall from Sec. 4.2 of Chapter 4 that an application may maintain more thanone Property Array, each of which requires a dedicated ABR pair. ABRs are part ofan application context and are exposed to the software. At application start-up, thegraph framework populates each ABR pair with the start and end virtual address of theentire Property Array (Fig. 6.1(b)). Setting these registers activates the custom cachemanagement for graph analytics. When the ABRs are not set by the software (i.e.,the default case for other applications), specialized cache management is essentiallydisabled. .2. GRASP: Caching In on the Skew The use of virtual addresses keeps the GRASP interface independent of the existingTLB design, allowing GRASP to perform address classification (described next) inparallel with the usual virtual-to-physical address translation carried out by TLB (seeFig. 6.2). Prior works have used similar approaches to pass data-structure bounds toaid microarchitecture mechanisms [10, 29, 38, 52].

This component of GRASP is responsible for reliably identifying cache blockscontaining hot vertices in hardware by leveraging the bounds of the Property Array(s)available in the ABRs as explained in the following sections:

Identifying hot vertices:

In theory, all hot vertices should be cached. In practice, itis unlikely that all hot vertices will fit in the LLC for large datasets as shown in Table 5.1of Chapter 5. In such a case, providing preferential treatment to all hot vertices is not beneficial as they can thrash each other in the LLC. To avoid this problem, GRASPprioritizes cache blocks containing only a subset of hot vertices, comprised of only thehottest vertices based on the available LLC capacity. Conveniently, the hottest verticesare located at the beginning of the Property Array in a contiguous region thanks tothe application of skew-aware reordering as shown in Fig. 6.1(a).

Pinpointing the High Reuse Region:

GRASP labels two LLC-sized sub-regionswithin the Property Array: The LLC-sized memory region at the start of the PropertyArray is labeled as

High Reuse Region ; another LLC-sized memory region startingimmediately after the High Reuse Region is labeled as the

Moderate Reuse Region (Fig. 6.1(c)). Finally, if an application specifies more than one Property Array, GRASPdivides LLC-size by the number of Property Arrays before labeling the regions.

Classifying LLC accesses:

At runtime, GRASP classifies a memory address makingan LLC access as

High-Reuse if the address belongs to the High Reuse Region of anyProperty Array; GRASP determines this by comparing the address with the boundsof the High Reuse Region of each Property Array. Similarly, an address is classifiedas

Moderate-Reuse if the address belongs to the Moderate Reuse Region. All otherLLC accesses are classified as

Low-Reuse . For non-graph applications, the ABRs arenot initialized and all accesses are classified as

Default , effectively disabling domain-specialized cache management. GRASP encodes the classification result (High-Reuse,Moderate-Reuse, Low-Reuse or Default) as a 2-bit

Reuse Hint , and forwards it to theLLC along with each cache request, as shown in Fig. 6.2, to guide specialized insertion Chapter 6. GRASP – Domain-Specialized Cache Management and hit-promotion policies as described next.

This component of GRASP implements specialized cache policies that protect thecache blocks associated with High-Reuse LLC accesses against thrashing. One naiveway of doing so is to pin the High-Reuse cache blocks in the LLC. However, pinningwould sacrifice any opportunity in exploiting temporal reuse that may be exposed byother cache blocks (e.g., Moderate-Reuse cache blocks).To overcome this challenge, GRASP adopts a flexible approach by augmentingan existing cache replacement policy with a specialized insertion policy for LLCmisses and a hit-promotion policy for LLC hits. GRASP’s specialized policies providepreferential treatment to High-Reuse blocks while maintaining flexibility in exploitingtemporal reuse in other cache blocks, as discussed next.

Insertion policy:

Accesses tagged as High-Reuse, comprising the set of the hottestvertices belonging to the High Reuse Region, are inserted in the cache at the MRUposition to protect them from thrashing. Accesses tagged as Moderate-Reuse, likelyexhibiting lower reuse when compared to the High-Reuse region, are inserted nearthe LRU position. Such insertion policy allows Moderate-Reuse cache blocks anopportunity to experience a hit without causing thrashing. Finally, accesses taggedas Low-Reuse, comprising the rest of the graph dataset, including the long tail ofthe Property Array containing cold vertices, are inserted at the LRU position, thusmaking them immediate candidates for replacement while still providing them withan opportunity to experience a hit and be promoted using the specialized policydescribed next.

Hit-promotion policy:

Cache blocks associated with High-Reuse LLC accesses areimmediately promoted to the MRU position on a hit to protect them from thrashing.LLC hits to blocks classified as Moderate-Reuse or Low-Reuse make for an interestingcase. On the one hand, the likelihood of these blocks having further reuse is quitelimited, which means they should not be promoted directly to the MRU position. Onthe other hand, by experiencing at least one hit, these blocks have demonstratedtemporal locality, which cannot be completely ignored. GRASP takes a middle groundfor such blocks by gradually promoting them towards MRU position on every hit.

Eviction policy:

GRASP’s eviction policy does not differentiate among blocks atreplacement time; hence, it is unmodified from the baseline technique. This is a .2. GRASP: Caching In on the Skew Reuse Hint Insertion Policy Hit PolicyHigh-Reuse RRPV = 0 RRPV = 0Moderate-Reuse RRPV = 6 if RRPV > Table 6.1: Policy columns show how GRASP updates per-block 3-bit RRPV counter ofRRIP (base technique) for a given Reuse Hint. Higher RRPV value indicates highereviction priority. key factor that keeps the cache management flexible for GRASP. By not prioritizingcandidates for eviction, GRASP ensures that blocks classified as High-Reuse but notreferenced for a long time can yield cache space to other blocks that do exhibit reuse.Because the unchanged eviction policy does not need to differentiate between blockswith High-Reuse and other hints, cache blocks do not need to explicitly store the ReuseHint as additional LLC metadata.Table 6.1 shows the specialized cache policies for all Reuse Hints under GRASP.While the table, and our evaluation, assumes RRIP [71] as the base replacementtechnique, we note that GRASP is not fundamentally dependent on RRIP and can beimplemented over many other techniques including, but not limited to, LRU, Pseudo-LRU and DIP [86].

The state-of-the-art history-based predictive techniques [8, 37, 39, 40, 67, 73] requireintrusive modifications to the cache structure in form of embedded metadata in cacheblocks and/or dedicated predictor tables. These techniques also require propagatinga PC signature through the core pipeline all the way to the LLC, which so far hashindered their commercial adoption.In comparison, GRASP is implemented within the same hardware structurerequired by the base technique (e.g., RRIP). GRASP propagates only a 2-bit Reuse Hintto the LLC on each cache access to guide cache policy decisions. By relying onlightweight software support, GRASP reliably pinpoints hot vertices in hardwarewithout requiring costly prediction tables and/or additional per-cache-block metadata.When compared to pinning-based techniques, GRASP policies protect hot verticesfrom thrashing while remaining flexible to capture reuse of other blocks as needed. Chapter 6. GRASP – Domain-Specialized Cache Management

Dataset Vertex Count Edge Count Avg. DegreeLiveJournal ( lj ) [51]

5M 68M pl ) [7]

43M 623M tw ) [74]

62M 1,468M kr ) [42]

67M 1,323M sd ) [7]

95M 1,937M fr ) [33]

64M 2,147M uni ) [92]

50M 1,000M Table 6.2: Properties of the graph datasets. Top five datasets are used in the mainevaluation whereas the bottom two datasets are used as adversarial datasets.

Combining robust cache policies with minimal hardware modifications makes GRASPfeasible for commercial adoption while also providing higher LLC efficiency.

For the evaluation, we use the same set of applications as we did in the Chapter 5 (seeTable 5.5). We combine these five applications – BC, SSSP, PR, PRD and Radii – withthe five high-skew graph datasets listed in Table 6.2, resulting in 25 benchmarks. Totest the robustness of GRASP to adversarial workloads, we use two additional datasetswith low-/no-skew.We obtained the source code for the graph applications from Ligra [57] and applieda simple data-structure optimization to improve locality in the baseline implementationas follows. As explained in Sec. 4.3 of Chapter 4, graph applications exhibit irregularaccesses for the Property Array, with applications potentially maintaining more thanone such array. When multiple Property arrays are used, elements correspondingto a given vertex may need to be sourced from all of the arrays. We merge thesearrays (i.e., Structure of Arrays to Array of Structure transformation) to induce spatiallocality, which reduces number of misses, and in turn, improves performance on alldatasets for PR, PRD and SSSP (see Table 6.3). We use the optimized implementationof these three applications as a stronger baseline for our evaluation. The optimizedapplications are available at https://github.com/faldupriyank/grasp. We do note that .3. Methodology

Application Merging Opportunity? Speed-upBC No -SSSP Yes 3-8%PR Yes 40-52%PRD Yes 14-49%Radii No -

Table 6.3: Effect of our optimization on the original Ligra implementation fordifferent applications. PR applies pull-based computations whereas SSSP appliespush-based computations throughout the execution; the rest of the applicationsswitch between pull or push based on a number of active vertices in a given iteration.

GRASP does not mandate merging arrays as GRASP design can accommodate multiplearrays. Nevertheless, merging does reduce the number of arrays needed to be tracked.For PRD, two versions of the algorithm are provided with Ligra: push-based andpull-push. In the baseline implementation, the push-based version is faster. However,after merging the Property Arrays, the pull-push variant performs better, and is whatwe use for the evaluation.

Methodology for the evaluation of software reordering techniques – Sort, Hub Sorting,DBG and Gorder – is identical to the methodology used in the previous Chapter (see5.4 of Chapter 5).

Simulation infrastructure:

We use the

Sniper [50] simulator modeling 8 OoO cores.Table 6.4 lists the parameters of the simulated system. The applications are evaluatedin a multi-threaded mode with 8-threads.We find that the graph applications spend significant fraction (86% on average inour evaluations) of time in push-based iterations for SSSP or pull-based iterations for allother evaluated applications. Thus, we simulate the

Region of Interest (ROI) coveringonly push- or pull-based iterations (whichever one dominates) for the respectiveapplications. Because simulating all iterations of a graph-analytic application in adetailed microarchitectural simulator is prohibitive, time-wise, we instead simulate Chapter 6. GRASP – Domain-Specialized Cache Management

Core OoO @ 2.66GHz, 4-wide front-endL1-I/D Cache 4/8-ways 32KB, 4 cycles access latencystride-based prefetchers with 16 streamsL2 Cache Unified, 8-ways 256KB, 6 cycles access latencyL3 Cache 16-ways 16MB NUCA (2MB slice per core), Non-InclusiveNon-Exclusive, 10 cycles bank access latencyNOC Ring network with 2 cycles per hopMemory 50ns latency, 2 on-chip memory controllers

Table 6.4: Parameters of the simulated system for evaluation of the hardwaretechniques. one iteration that has the highest number of active vertices. To validate the soundnessof our methodology, we also simulated one more randomly chosen iteration for eachapplication-dataset pair with at least 20% of vertices active and observed trends similarto the ones reported in the paper.

Evaluated cache management techniques:

We evaluate GRASP and compare itwith the state-of-the-art thrash-resistant cache management techniques describedbelow.

RRIP [71] is the state-of-the-art technique among static and lightweight dynamictechniques that do not depend on history-based learning. RRIP is the most appropriatecomparison point given that GRASP builds upon RRIP as the base technique (Sec. 6.2.3).We implement RRIP (specifically,

DRRIP ) based on the source code from the cachereplacement championship [68] for RRIP, and use a 3-bit counter per cache block. Weuse RRIP as high performance baseline and report speed-up for all hardware techniquesover the RRIP baseline (except for the studies in Sec 6.4.4 that use LRU baseline).

Signature-based Hit Predictor (SHiP) [67] is the state-of-the-art insertion policywhich builds on RRIP [71]. Due to the shortcomings of PC-based correlation for graphapplications as explained in Sec. 4.6 of Chapter 4, we evaluate a SHiP-MEM variantthat correlates a block’s reuse with the block’s memory region. We evaluate 16KBmemory regions as in the original proposal. The predictor table is provisioned withan unlimited number of entries to assess the maximum potential of this technique.Every entry in the predictor table contains a 3-bit saturating counter that tracks there-reference behavior of cache blocks of the memory region associated with that entry.

Hawkeye [37] is the state-of-the-art cache management technique and winner of the .4. Evaluation recent cache replacement championship (CRC2) [20]. Hawkeye trains its predictortable by simulating OPT [114] on past LLC accesses to infer block’s cache friendliness.We use an improved, prefetch-aware, version of Hawkeye from CRC2 (i.e., Hawkeye++from Sec. 3.6 of Chapter 3) We appropriately scale the number of sampling sets andpredictor table entries for a 16MB cache.

Leeway (specifically, Leeway-NRU from Chapter 3) is a history-based predictive cachemanagement technique that applies dead block predictions based on a metric calledLive Distance, which conservatively captures the reuse interval of a cache block. Weappropriately scale the number of sampling sets and predictor table entries for a 16MBcache.

XMem [10] is a pinning-based technique proposed for algorithms that benefit from cache tiling . Once pinned, a cache block cannot be evicted until explicitly unpinnedby the software, usually done when the processing of a tile is complete. In the originalproposal, XMem reserves 75% of LLC capacity to pin tile data whereas the remainingcapacity is managed by the base replacement technique for the rest of the data. In thiswork, we explore four configurations of XMem, labeled PIN-X, where X refers to thepercentage (25%, 50%, 75% or 100%) of LLC capacity reserved for pinning. We adoptXMem design for graph analytics and identify the cache blocks from the high reuseregion that benefit from pinning using the GRASP interface. Finally, XMem requiresan additional 1-bit for every cache block to identify whether a cache block is pinned,along with an additional mechanism to track how much of the capacity is used by thepinned cache blocks at any given time.

GRASP is the proposed domain-specialized cache management technique for graphanalytics. We instrument the applications to communicate the address bounds of theProperty Arrays to the simulated GRASP hardware. For the evaluated applications,we needed to instrument at most two arrays. Finally, GRASP uses RRIP as the basecache policy with a 3-bit saturating counter and does not add any further storage toper-block metadata.

We first evaluate hardware cache management techniques on top of a software skew-aware reordering technique (Sec. 6.4.1 & 6.4.2). Due to long simulation time, evaluatingall hardware techniques on top of all reordering techniques would be prohibitive. Thus,without loss of generality, we evaluate hardware techniques on top of DBG, which Chapter 6. GRASP – Domain-Specialized Cache Management -19 -17 -17 -21 -19 -19 -19 -34 -23 -44 -27 -19 -33 -23 -44 -27 -30 -20 -33 -31 -23

BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all -15-10-5051015 % M i ss e s E li m i na t ed SHIP-MEM Hawkeye Leeway GRASP

Figure 6.3: LLC miss reduction for GRASP and state-of-the-art history-basedpredictive techniques over the RRIP baseline. -14 -12 -15 -17 -16 -15-14 -13 -12 -25 -16 -30 -20 -13 -24 -16 -30 -20 -18 -17 -21 -16

BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all -10-50510 S peed - up ( % ) SHIP-MEM Hawkeye Leeway GRASP

Figure 6.4: Speed-up for GRASP and state-of-the-art history-based predictive cachemanagement techniques over the RRIP baseline. consistently outperforms other reordering techniques (Sec. 6.4.3.1). In Sec. 6.4.3.2, weevaluate GRASP with other reordering techniques to show GRASP’s generality.

In this section, we compare GRASP with the state-of-the-art hardware techniques,SHiP-MEM [67], Hawkeye [37] and Leeway. As we showed in Chapter 4, RRIPconsistently outperforms LRU across the datapoints, we use RRIP as a stronger baseline.Finally, we use DBG as the software baseline; thus, all speed-ups reported in this sectionare over and above DBG . Miss reduction:

Fig. 6.3 shows the miss reduction over the RRIP baseline. GRASPconsistently reduces misses on all datapoints, eliminating 6.4% of LLC misses onaverage and up to 14.2% in the best case (on lj dataset for the Radii application).The domain-specialized design allows GRASP to accurately identify the high-reuseworking set (i.e., hot vertices), which GRASP is able to retain in the cache through itsspecialized policies, effectively exploiting the temporal reuse.Among prior techniques, Leeway is the only technique that reduces misses, albeitmarginal, with an average miss reduction of 1.1% over the RRIP baseline. The othertwo techniques are not effective for graph applications, with SHiP-MEM and Hawkeye increasing misses across the datapoints, with an average miss reduction of -4.8% and .4. Evaluation -22.7%, respectively, over the baseline. This is a new result as prior works show thatHawkeye and SHiP-MEM outperform RRIP on a wide range of applications [37, 67].Theresult indicates that the learning mechanisms of the state-of-the-art domain-agnostictechniques are deficient in retaining the high reuse working set (i.e., hot vertices) forgraph applications, which ends up hurting application performance as discussed next. Application speed-up:

Fig. 6.4 shows the speed-up for hardware techniques overthe RRIP baseline. Overall, performance correlates well with the change in LLC misses;GRASP consistently provides a speed-up across datapoints with an average speed-upof 5.2% and up to 10.2% in the best case (on pl dataset for SSSP application) over thebaseline. When compared to the same baseline, SHiP-MEM and Hawkeye consistentlycause slowdown with an average speed-up of -5.5% and -16.2%, respectively whereasLeeway yields a marginal speed-up of 0.9%. Finally, when compared to prior worksdirectly, GRASP yields 4.2%, 5.2%, 11.2% and 25.5% average speed-up over Leeway,RRIP, SHiP-MEM and Hawkeye, respectively, while not causing slowdown on anydatapoints.Recall from Chapter 4, in which we also evaluated prior techniques withoutapplying any vertex reordering. As shown in Fig. 4.5, Leeway, SHiP-MEM and Hawkeyeyield an average speed-up of -0.8%, -5.7% and -14.8%, respectively, over RRIP on thedatasets with no reordering. Dissecting performance of SHiP-MEM:

SHiP-MEM is a predictive technique thatpredicts reuse of a cache block based on the fine-grained memory region it belongs to.Thus, SHiP-MEM relies on a homogeneous cache behavior for all blocks belongingto the same memory region. In theory, DBG should allow SHiP-MEM to identifymemory regions containing hottest of vertices (corresponding to High Reuse Regionfrom Fig. 6.1(c)). In practice, however, irregular access patterns to these regions andthrashing by cache blocks from other regions impede learning. Thus, despite leveragingsoftware and utilizing a sophisticated storage-intensive prediction mechanism inhardware, SHiP-MEM underperforms domain-specialized GRASP.

Dissecting performance of Hawkeye:

Hawkeye is the state-of-the-art predictivetechnique that uses PC-based correlation to predict whether a cache block has acache-friendly or cache-averse behavior based on past LLC accesses. Thus, Hawkeyefundamentally relies on homogeneous cache behavior for all blocks accessed by thesame PC address. When Hawkeye is employed for graph analytics, Hawkeye strugglesto learn the behavior of cache blocks in the Property Array as hot vertices exhibit Chapter 6. GRASP – Domain-Specialized Cache Management cache-friendly behavior while cold vertices exhibit cache-averse behavior, yet allvertices are accessed by the same PC address. To make matters worse, if a blockincurs a hit and Hawkeye predicts the PC making the access as cache-averse, thecache block is prioritized for eviction instead of promoting the block to MRU as isdone in the baseline. Thus, Hawkeye performs even worse than the baseline for allcombinations of graph applications and datasets. While not evaluated, other priorPC-based techniques (e.g., [67, 73]) that rely on a PC-based correlation would alsostruggle on graph applications for the same reason.

Dissecting performance of Leeway:

Leeway, like Hawkeye, also relies on a PC-based reuse correlation, and thus is not expected to provide significant speed-ups forgraph-analytics. However, Leeway successfully avoids the slowdown on 10 of the 25datapoints and significantly limits the slowdown on the rest of the datapoints (maxslowdown of 2.1% vs 13.6% for SHiP-MEM and 30.2% for Hawkeye). The reasons whyLeeway perfroms better than prior PC-based techniques can be attributed to (1) theconservative nature of the Live Distance metric, which Leeway uses to determineif a cache block is dead, and (2) adaptive reuse-aware policies that control the rateof predictions based on the observed access patterns. Because of these two factors,performance of Leeway remains close the the base replacement technique in thepresence of variability in the reuse behavior of cache blocks.

Dissecting performance of GRASP:

Performance of GRASP over its base technique,RRIP, can be attributed to three features: software hints, insertion policy and hit-promotion policy. Fig. 6.5 shows the performance impact due to each of these features.RRIP inserts every new cache block at one of the two positions (as specified in theDefault Reuse Hint of Table 6.1); a cache block is inserted at the LRU position witha high probability or near the LRU position with a low probability. RRIP+Hintsis identical to RRIP except for how a new cache block is assigned these positions.RRIP+Hints uses software hints (similar to GRASP) to guide the insertion. A cacheblock with High-Reuse hint is inserted near the LRU position and all other blocksare inserted at the LRU position. GRASP (Insertion-Only) refers to the techniquethat applies insertion policy of GRASP as specified in Table 6.1 but the hit-promotionpolicy is unchanged from RRIP. Finally, GRASP (Hit-Promotion) refers to the techniquethat applies hit-promotion policy of GRASP along with its insertion policy, which isessentially the full GRASP design. Note that each successive technique adds a newfeature on top of the features incorporated by the previous ones. For example, GRASP .4. Evaluation

BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all-404812 S peed - up ( % ) RRIP+Hints GRASP (Insertion-Only) GRASP (Hit-Promotion)

Figure 6.5: Impact of GRASP features on performance. (Insertion-Only) features a new insertion policy in addition to the software hints.As the figure shows, RRIP+Hints yields an average speed-up of 3.3% overprobabilistic RRIP, confirming the utility of software hints. GRASP (Insertion-Only)further increases performance by yielding an average speed-up of 5.0%. GRASP(Insertion-Only) provides additional protection to the High-Reuse cache blocks incomparison to RRIP+Hints by inserting High-Reuse cache blocks directly at the MRUposition. Finally, GRASP (Hit-Promotion) yields an average speed-up of 5.2%.Difference between GRASP (Hit-Promotion) and GRASP (Insertion-Only) is marginalas the hit-promotion policy of GRASP has negative effect on slightly less than halfdatapoints. The results are inline with the observations from our work that showedthat the value-addition of hit-promotion policies over insertion policies is low inpresence of cache thrashing [32].

Summary:

Hardware cache management is an established difficult problem, which isreflected in the small average speed-ups (usually 1%-5%) achieved by state-of-the-arttechniques over the prior best techniques [8, 37, 39, 40, 67, 71, 73]. Our work shows thatgraph applications present a particularly challenging workload for these techniques, inmany cases leading to significant performance slowdowns. In this light, GRASP is quitesuccessful in improving performance of graph applications by yielding an averagespeed-up of 5.2% (max 10.2%) over a high performing software and hardware baseline,while not causing slowdown on any datapoint. Moreover, unlike state-of-the-arttechniques, GRASP achieves this without requiring storage-intensive metadata.

In this section, we show the benefit of flexible GRASP policies over pinning-basedrigid approaches. We first present the results on the high-skew datasets and then onthe low-/no-skew datasets to test their resilience in adversarial scenarios.

High-skew datasets:

Fig. 6.6 shows speed-ups for four XMem configuration (PIN-25, Chapter 6. GRASP – Domain-Specialized Cache Management -8 BC SSSP PR PRD Radii GM lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd lj pl tw kr sd all -50510 S peed - up ( % ) PIN-25 PIN-50 PIN-75 PIN-100 GRASP

Figure 6.6: Speed-up for GRASP and pinning-based techniques over the RRIPbaseline on high-skew datasets.

PIN-50, PIN-75 and PIN-100) and GRASP over the RRIP baseline on high-skew datasets.GRASP outperforms all XMem configurations on 24 of 25 datapoints with an averagespeed-up of 5.2%. In comparison, PIN-25, PIN-50, PIN-75 and PIN-100 yield 0.4%, 1.1%,2.0% and 2.5%, respectively.PIN-100 outperforms the other three XMem configurations as for thoseconfigurations, significant fraction of the capacity can still be occupied by coldvertices, which causes thrashing in the unreserved capacity. Nevertheless, PIN-100causes slowdown on many datapoints (e.g., for BC, PR and PRD applications on tw and sd datasets). Moreover, PIN-100 cannot capitalize on reuse from Moderate ReuseRegion as pinned vertices cannot be evicted even when they stopped exhibiting reuse.Thus, PIN-100 provides only a marginal speed-up on many datapoints (e.g., Radiiapplication on lj , tw and kr datasets).PIN-75 and PIN-100, (two of the high performing XMem configurations), whileyield only marginal speed-ups, still outperform the state-of-the-art domain-agnostictechniques – SHiP-MEM, Leeway and Hawkeye – (Figs. 6.4 & 6.6) which confirmsthat utilizing software knowledge for cache management is a promising directionover a storage-intensive domain-agnostic design for the challenging access patternsof graph analytics. Low-/No-skew datasets:

Next, we evaluate the robustness of GRASP and pinning-based techniques (PIN-75 and PIN-100) for adversarial datasets with low-/no-skew.Naturally, these techniques are not expected to provide a significant speed-up in theabsence of high skew; however, a robust technique would reduce/avoid the slowdown.Fig. 6.7 shows the speed-up for a low-skew dataset fr and a no-skew dataset uni forthese techniques over the RRIP baseline.GRASP provides a positive speed-up on 9 out of 10 datapoints even for low-/no-skew datasets. On the low-skew dataset fr , GRASP yields a speed-up between 0.4%and 4.3% whereas on the no-skew dataset uni , GRASP yields a speed-up between -0.1% .4. Evaluation -14 fr uni BC SSSP PR PRD Radii BC SSSP PR PRD Radii -6-3036 S peed - up ( % ) PIN-75 PIN-100 GRASP

Figure 6.7: Speed-up over the RRIP baseline on fr , a low-skew dataset and uni , ano-skew dataset. and 2.4%. In contrast, PIN-75 and PIN-100 cause slowdown on almost all datapoints.In the absence of high-skew, cache blocks belonging to the High Reuse Regiondo not dominate the overall LLC accesses. Thus, pinning these blocks throughoutthe execution is counter-productive for PIN-75 and PIN-100. In contrast, GRASPadopts a flexible approach, wherein the high priority cache blocks from High ReuseRegion can make way for other blocks that observe some reuse, as needed. Thus,GRASP successfully limits slowdown, and even provides reasonable speed-up on somedatapoints, for such highly adversarial datasets.Finally, combining results on all 7 datasets (5 datasets from Fig. 6.6 and 2 fromFig. 6.7), GRASP yields an average speed-up of 4.1%. In comparison, PIN-75 andPIN-100 provide a marginal speed-up of only 0.5% and 0.1%, respectively. PIN-75 andPIN-100 cause slowdown of up to 5.3% and 14.2% whereas max slowdown for GRASPis only 0.1%. Thus far, we evaluated GRASP on graph applications processing datasets that arereordered using DBG. In this section, we compare performance of vertex reorderingtechniques, followed by an evaluation of GRASP on top of these techniques,demonstrating GRASP’s generality.

In this section, we first summarize the performance of skew-aware techniques – Sort,HubSort [28] and DBG – for graph applications processing high-skew datasets. We alsoevaluate Gorder [41], a complex vertex reordering approach. Note that the softwaretechniques are evaluated on a real machine with 40 hardware threads as mentioned inSec. 5.4.2 of Chapter 5.Fig. 6.8(a) shows the speed-up for these software techniques after accounting Chapter 6. GRASP – Domain-Specialized Cache Management -17-17 -59 -85 -90 -94 -81 -86 -81 -59 -89 -94 -85 -100102030 lj pl tw kr sd BC SSSP PR PRD Radii GM S peed - up ( % ) Sort HubSort DBG Gorder (a)

Net speed-up for existing software reordering techniques afteraccounting for their reordering cost on a real machine. lj pl tw kr sd BC SSSP PR PRD Radii GM S peed - up ( % ) Over Sort Over HubSort Over DBG Over Gorder(+DBG) (b)

Application speed-up of GRASP over the RRIP baseline on top ofdifferent reordering techniques.Figure 6.8: Reordering Techniques + GRASP: the left group shows speed-up fora dataset across all applications while the right group shows speed-up for anapplication across all datasets. for their reordering cost over the baseline with no reordering. Among skew-awaretechniques, all techniques are effective on largest of the datasets (e.g., kr and sd ) andlong iterative applications (e.g., PR). As these techniques rely on a low cost approachfor reordering, the reordering cost is amortized quickly when the application runtimeis high, making these solutions practically attractive. Averaged across all applicationand dataset pairs, skew-aware techniques yield a net speed-up of 2.6% for Sort, 0.6%for HubSort and 10.8% for DBG.Unsurprisingly, Gorder causes a significant slowdown on all datapoints due to itslarge reordering cost, yielding an average speed-up of -85.4%. Thus, Gorder is lesspractical when compared to simple yet effective skew-aware techniques. As software vertex reordering techniques offer different trade-offs in preservinggraph structure and reducing reordering cost, it is important for GRASP to not becoupled to any one software technique. In this section, we evaluate GRASP withdifferent reordering techniques, both skew-aware and complex ones. While skew-aware techniques are readily compatible with GRASP, Gorder requires a simple tweakas follows. .4. Evaluation lj pl tw kr sd BC SSSP PR PRD Radii GM % M i ss e s E li m i na t ed RRIP GRASP OPT

Figure 6.9: Percentage of misses eliminated over LRU.

After applying Gorder on an original dataset, we apply DBG to further reordervertices, which results in a vertex order that retains most of the Gorder ordering whilealso segregating hot vertices in a contiguous region, making Gorder compatible withGRASP.Fig. 6.8(b) shows the speed-up for GRASP over RRIP on top of the same reorderingtechnique as the baseline. As with DBG, GRASP consistently provides a speed-upacross datasets and applications on top of other reordering techniques as well. Onaverage, GRASP yields a speed-up of 4.4%, 4.2%, 5.2% and 5.0% on top of Sort, HubSort,DBG and Gorder, respectively. The result confirms that GRASP complements a broadclass of existing software reordering techniques.

In this section, we compare GRASP with Belady’s optimal replacement policy(OPT) [114]. As OPT requires the perfect knowledge of the future, we generate thetraces of LLC accesses (up to 2 billion for each trace) for the applications processinggraph datasets reordered using DBG on the simulation baseline configurationspecified in Sec. 6.3.3. We apply OPT on each trace for five different LLC sizes – 1MB,4MB, 8MB, 16MB and 32MB – to obtain the minimum number of misses for a givencache size and report the percentage of misses eliminated over

LRU on the same LLCsize.

Miss reduction on 16MB LLC:

Fig. 6.9 shows the results for OPT along with RRIPand GRASP for 16MB LLC size. OPT eliminates 34.3% of total misses over LRU. Incomparison, GRASP eliminates 19.7% of misses (vs 15.2% for RRIP). Overall, GRASP is57.5% effective in eliminating misses when compared to OPT, an offline technique withperfect knowledge of the future. While GRASP is the most effective among the onlinetechniques, the results also show that the remaining opportunity (difference betweenOPT and GRASP) is still significant, which warrants further research in this direction.

Sensitivity of GRASP to LLC size:

Table 6.5 shows the average percentage of Chapter 6. GRASP – Domain-Specialized Cache Management

Technique 1MB 4MB 8MB 16MB 32MBRRIP 15.9% 16.4% 15.7% 15.2% 16.2%GRASP 15.4% 17.0% 18.1% 19.7% 21.2%OPT 27.5% 32.2% 33.3% 34.3% 34.5%

Table 6.5: Percentage of misses eliminated over LRU for different LLC size. misses eliminated by RRIP, GRASP and OPT for different LLC sizes over LRU. With theincrease in LLC size, GRASP becomes more effective at eliminating misses over LRU(average miss reduction of 15.4% for 1MB vs 21.2% for 32MB). This is expected, as thelarger LLC size allows GRASP to provide preferential treatment to more hot vertices.In general, yet larger LLC sizes are expected to benefit even more from GRASP untilthe LLC size becomes large enough to accommodate all hot vertices.

Shared-memory graph frameworks:

A significant amount of research has focusedon designing high performance shared-memory frameworks for graph applications.Majority of these frameworks are vertex-centric [42, 48, 55, 57, 62, 75] and use CSR or itsvariants to encode a graph, making GRASP readily compatible with these frameworks.More generally, GRASP requires classification of only the Property Array(s), making itindependent of the specific data structure used to represent the graph, which furtherincreases compatibility across the spectrum of frameworks. Thus, we expect GRASPto reduce misses across frameworks, though absolute speed-ups will likely vary.

Distributed-memory graph frameworks:

Distributed graph processingframeworks can also benefit from GRASP. For example, PGX [44] andPowerGraph [61] proposed duplicating high degree vertices in the graph partitions toreduce high communication overhead across computing nodes. These optimizationsare largely orthogonal to GRASP cache management. As such, GRASP can be appliedto distributed graph processing by caching high-degree vertices within each node’sLLC to improve node-level cache behavior.

Streaming graph frameworks:

In this work, we have assumed that graphs arestatic. In practice, graphs may evolve over time and a stream of graph updates (i.e.,addition or removal of vertices or edges) are interleaved with graph-analytic queries(e.g., computing PageRank of vertices or computing shortest path from different .6. Conclusion root vertices). For such deployment settings, a CSR-based structure is infeasible.Instead, researchers have proposed various data structures for graph encoding thatcan accommodate fast graph updates and allow space-efficient versioning [2, 46, 60].Meanwhile, each graph query is performed on a consistent view (i.e., static snapshot)of a graph. For example, Aspen [2], a recent graph-streaming framework, uses Ligra(a static graph-processing framework) in the back-end to run graph-analytic queries.Thus, the observations made in this paper regarding cache thrashing due to theirregular access patterns of the Property Array, as well as skew-aware reorderingand GRASP being complementary in combating cache thrashing, are also relevant fordynamic graphs.For static graphs, vertex reordering cost is amortized over multiple graph traversalsfor a single graph query (as shown in Fig. 6.8(a)). However, for dynamic graphs,reordering cost can be further amortized over multiple graph queries. Intuitively,addition or deletion of some vertices or edges in a large graph would not lead to adrastic change in the degree distribution, and thus unlikely to change which verticesare classified hot in a short time window. Therefore, skew-aware reordering can beapplied at periodic intervals to improve cache behavior after a series of updates hasbeen made to a graph, amortizing reordering cost over multiple graph queries.

Hardware prefetchers:

Modern processors typically employ prefetchers that targetstride-based access patterns and thus are not amenable to graph analytics. Researchershave proposed custom prefetchers at L1-D that specifically target indirect memoryaccess patterns of graph analytics [29, 49]. Nevertheless, prefetching can only hide memory access latency. Unlike cache replacement, prefetching cannot reduce memorybandwidth pressure or DRAM energy expenditure. Indeed, prior work observes thateven the ideal, 100% accurate, prefetcher for graph analytics is bottlenecked by memorybandwidth [49]. In contrast, GRASP reduces bandwidth pressure by reducing LLCmisses, and thus is complementary to prefetching.

In this chapter, we explored how to design hardware cache management to tacklecache thrashing at LLC for the domain of graph analytics. We showed that state-of-the-art history-based predictive cache management techniques are deficient inthe presence of cache thrashing stemming from irregular access patterns of graphapplications processing large graphs. In response, we introduced GRASP, specialized Chapter 6. GRASP – Domain-Specialized Cache Management cache management for LLC for graph analytics on natural graphs. GRASP’s specializedcache policies exploit the high reuse inherent in hot vertices while maintaining theflexibility to capture reuse in other cache blocks. GRASP leverages software reorderingoptimizations such as DBG to enable a lightweight interface that allows hardware toreliably pinpoint hot vertices amidst irregular access patterns. In doing so, GRASPavoids the need for a storage-intensive prediction mechanism or additional metadatastorage in the LLC. GRASP requires minimal hardware support, making it attractive forintegration into commodity server processors to enable acceleration for the domain ofgraph analytics. Finally, GRASP delivers consistent performance gains on high-skewdatasets, while preventing slowdowns on low-skew datasets. hapter 7Conclusions and Future Work

In this section, we summarize the main contributions made in the preceding chapters.

In Chapter 3, we highlighted the limitations of state-of-the-art history-based predictivetechniques in achieving high performance in the face of variability. To addressthose limitations, we argued for variability-tolerant mechanisms and policies forcache management. As a step in that direction, we proposed Leeway, a history-based predictive technique employing two variability-tolerant features. First, Leewayintroduces a new metric, Live Distance, that captures the largest interval of temporalreuse for a cache block, providing a conservative estimate of a cache block’s usefullifetime. Second, Leeway implements a robust prediction mechanism that identifiesdead blocks based on their past Live Distance values. To maximize cache efficiency inthe face of variability, Leeway monitors the change in Live Distance values at runtimeusing its reuse-aware policies to adapt to the observed access patterns. Meanwhile,Leeway embeds prediction metadata with cache blocks in order to avoid critical pathhistory table look-ups on cache hits and reduce the on-chip network traffic, in contrastto the state-of-the-art techniques that access history table on every cache access(including cache hits). On a variety of applications and deployment scenarios, Leewayconsistently provides good performance that generally matches or exceeds that ofstate-of-the-art techniques.

Chapter 7. Conclusions and Future Work

In Chapter 5, we studied existing skew-aware reordering techniques that seek toimprove cache efficiency for graph analytics by reducing the cache footprint of hotvertices. We demonstrated the inherent tension between reducing the cache footprintof hot vertices and preserving original graph structure, which limits the effectivenessof existing skew-aware reordering techniques. In response, we proposed Degree-BasedGrouping (DBG), a lightweight vertex reordering software technique that employscoarse-grain reordering to preserve graph structure while reducing the cache footprintof hot vertices. On a variety of graph applications and datasets, DBG achieves higheraverage performance than all existing skew-aware techniques and nearly matches theaverage performance of the state-of-the-art complex reordering technique.

In Chapter 6, we explored how to design hardware cache management to tacklecache thrashing at LLC for the domain of graph analytics. We showed that state-of-the-art history-based predictive cache management techniques are deficient inthe presence of cache thrashing stemming from irregular access patterns of graphapplications processing large graphs. In response, we introduced GRASP, specializedcache management for LLC for graph analytics on natural graphs. GRASP’s specializedcache policies exploit the high reuse inherent in hot vertices while maintaining theflexibility to capture reuse in other cache blocks. GRASP leverages software reorderingoptimizations such as DBG to enable a lightweight interface that allows hardware toreliably pinpoint hot vertices amidst irregular access patterns. In doing so, GRASPavoids the need for a storage-intensive prediction mechanism or additional metadatastorage in the LLC. GRASP requires minimal hardware support, making it attractive forintegration into commodity server processors to enable acceleration for the domain ofgraph analytics. Finally, GRASP delivers consistent performance gains on high-skewdatasets, while preventing slowdowns on low-skew datasets.

In this section, we perform a critical analysis of the proposals presented in the priorchapters. .2. Critical Analysis

Hardware overhead of a cache management technique may hinder its commercialadoption. Leeway, like the state-of-the-art Hawkeye and most other history-basedtechniques, requires a PC signature to be propagated through the core pipeline all theway to the LLC. Leeway also requires slightly higher storage than the prior techniques(e.g., 44KB for Leeway vs 31KB for Hawkeye) to store recency state and other predictionmetadata. However, it is noteworthy that the total storage requirement for Leewayis only 1.4% of LLC capacity. More importantly, Leeway accesses the history tablecompletely off the critical path, unlike Hawkeye, and requires significantly fewernumber of look-ups than prior techniques.GRASP altogether removes the requirement of history table, and in turn,propagation of a PC signature too. Instead, reuse predictions rely on a new interface,which software uses to pass semantic information of the application to the hardware.While the interface is lightweight, it does require a new LLC component that isphysically placed near the core. While such distributed design of LLC componentsmay not pose a technical challenge, it may incur extra organizational cost by requiringadditional communication between core, cache design and verification teams.Overall, the hardware overheads of our proposals are generally at or belowpar with the state-of-the-art techniques. Meanwhile, they generally provide higherperformance improvements compared to the state-of-the-art techniques across avariety of applications and deployment scenarios, making them promising candidatesamong the high-performance prior techniques for commercial adoption.

In this thesis, we use a simulation-based methodology to evaluate various cachemanagement techniques. Our decision to restrict ourselves to simulationinfrastructures, and therefore trading off accuracy and cost for speed and ease ofevaluation, is influenced by the prohibitive cost to evaluate the architecturalmodifications in real chips. We follow a well accepted practice for architectureresearch in both academia and industry to evaluate performance impact ofmicroarchitecture features by simulations. Having said that, we do note that ourproposals presented in this thesis are backed by intuitive reasoning and soundmodeling of cache statistics (e.g., modeling of miss rate or MPKI) to ensurereproducibility of results on real chips. Chapter 7. Conclusions and Future Work

In this thesis, we proposed domain-specialized cache management only for the domainof graph analytics. In practice, there are numerous other emerging domains suchas data analytics, machine learning and other big data applications (e.g., populardata center applications such as web search and data serving) that could potentiallybenefit from the domain-specialized cache management. We do not characterizethose applications as studying the fundamental cache access patterns of all (or asubset of) applications from a given domain requires significant time, resources anddomain expertise. However, doing so may not be a barrier for a commercial entity,which wishes to accelerate a particular domain of interest that is considered of ahigh value for their business. Therefore, we envision that in the future systems,for selected high-value domains, LLC will be managed via domain-specialized cachemanagement (such as GRASP) and for the rest of the applications, LLC will be managedvia a robust domain-agnostic technique such as Leeway. It is noteworthy that eachdomain-specialized cache management technique may not necessarily require a uniquesoftware-hardware interface as the interface can be made abstract (as done for GRASP),and can be generalized to meet the requirements of a set of domains.

In this section, we highlight limitations of our proposals presented in the precedingchapters and highlight potential future directions for the research in cachemanagement.

As explained in Chapter 2, a cache hierarchy can be maintained as fully-inclusive , as fully-exclusive or as non-inclusive non-exclusive (NINE) . In this thesis, we simulatedLeeway and GRASP under NINE LLC. Leeway and GRASP (as well as state-of-the-arthistory-based predictive techniques) employ aggressive prediction mechanisms toreduce cache thrashing. For example, Leeway bypasses the insertion for cache blocksthat are predicted dead on arrival by forwarding data directly to the higher-levelcaches and GRASP inserts cache blocks that are expected to have no reuse with theleast priority, immediately making them eviction candidates. While such mechanismsare useful in reducing cache pollution, and in turn, improving application performance .3. Future Work for NINE LLC, they cannot be readily ported to fully-inclusive and fully-exclusive LLCas discussed below.

Fully-inclusive LLC:

For the fully-inclusive LLC, a cache block eviction at LLCrequires a back invalidation to evict the same cache block from all the higher-levelcaches to maintain inclusion. Under such an inclusion policy, bypassing, by definition,is not possible as LLC must contain the cache blocks present in any higher-levelcaches. Similarly, other aggressive mechanisms may not always be beneficial for fully-inclusive LLC as cache blocks that do not exhibit any reuse at LLC may exhibit highreuse at the higher-level caches. Evicting such cache blocks from fully-inclusive LLCtriggers back invalidation, leading to premature evictions of these cache blocks fromthe higher-level caches. Therefore, accommodating such aggressive thrash-resistantmechanisms for fully-inclusive LLC may require coordination across different levelsof the cache hierarchy such as

Query Based Selection (QBS) [70]. While QBS has beenshown to work for recency-friendly techniques like LRU or NRU, integrating QBSfor aggressive thrash-resistant techniques such as Leeway (or prior history-basedtechniques) remains an open question as discussed below.QBS selects a provisional victim (e.g., LRU cache block) and queries the higher-levelcaches (e.g., L1, L2 or both) to check if they contain a provisionally selected victimcache block. If they do, QBS infers that the provisional victim has long temporal reusein the higher-level caches, and thus gives it a second chance by increasing the priority ofthe provisional victim (e.g., by moving the victim to the MRU position). Subsequently,QBS attempts to find another victim, such as the second least recently cache block andso on. Meanwhile, if the provisional victim is not present in the higher-level caches,QBS evicts the block from LLC. Intuitively, the time window for a block to move fromthe MRU position to the LRU position at LLC under recency-friendly techniques isreasonably big, which allows the higher-level caches to completely exploit the reusefor the cache blocks having short temporal reuse. Thus, QBS policy is effective forrecency-friendly techniques as it can differentiate cache blocks with long temporalreuse from the blocks with short temporal reuse in the higher-level caches. However,combining QBS with aggressive thrash-resistant techniques at LLC pose a challenge.Consider an example of SHiP, which inserts a significant fraction of cache blocks atthe LRU position, leaving little time for the higher-level caches to fully exploit thereuse of many cache blocks. Therefore, a significant fraction of victim cache blocksare likely to be present in the higher-level caches, forcing QBS to provide them secondchance. However, doing so defeats the purpose of their insertion at the LRU position Chapter 7. Conclusions and Future Work as these blocks are unlikely to exhibit any reuse.

Fully-exclusive LLC:

For the fully-exclusive LLC, on LLC hit, a cache block is movedfrom LLC to L2, which involves an eviction at LLC and an insertion at L2. Thus, bydesign, in a single generation of a cache block, the block can incur at most one hit.Under such an inclusion policy, a cache block is evicted from LLC on a cache hit, andthus looses reuse information (e.g., Live Distance for Leeway) that, otherwise, can beaccumulated over the block’s on-chip residency. One potential way to mitigate thisis by utilizing the cache directory. The directory keeps track of the coherence statefor each cache block. The directory is usually inclusive of all on-chip cache blockseven when the LLC is not. Thus, directory can be augmented to accumulate reuseinformation per cache block during the block’s on-chip residency.

Like Leeway, most of the prior history-based predictive techniques rely on a PC-basedreuse correlation for reuse prediction [8, 13, 14, 17, 18, 19, 24, 25, 27, 37, 39, 40, 67, 73,81]. Thus, they require propagating a PC signature through the core pipeline all theway to the LLC. While a PC signature requires far fewer bits than a full PC address(e.g., 14-bits for a PC signature vs 48-bits for a full PC address), number of bits neededto be added in a cache request to accommodate a PC signature is still non-trivial,which so far has hindered the commercial adoption of PC-based predictive techniquesfor LLC management. This calls for new mechanisms to predict reuse of cache blocksthat do not rely on PC signatures, but provide performance that is on par, if not above,with the PC-based predicting techniques.GRASP employs one such mechanism that leverages a lightweight software support.GRASP not just eliminates the need for propagating a PC signature but also eliminatesthe need for storage-intensive history tables altogether. GRASP requires propagatingonly a 2-bit Reuse Hint to the LLC on each cache access to guide cache policy decisions.

Software vertex reordering techniques are effective when the time required for thereordering is less than the reduction in the execution time of an application due toimproved cache efficiency. For applications that have small execution time, reorderingcost of a vertex reordering technique may not be amortized, resulting in a net slowdown(e.g., SSSP from one root traversal in Fig. 5.10 of Chapter 5). However, we believe, .4. Concluding Remarks there are two future research directions that have potential to amortize reorderingcost even for such applications.

Integrating reordering techniques with graph generation:

In this thesis, weassumed that the graph datasets are readily available, and thus also assumed that thespatio-temporal locality in real-world datasets (specifically for the structured datasets)exists without any overhead. In practice, such ordering may be a positive side effectof dataset generation algorithm (e.g., crawling webpages in certain order) or it mayhave been achieved by post-processing a dataset (e.g., graph datasets available fromThe Laboratory for Web Algorithmics have been ordered with the Layered LabelPropagation technique [65]). Thus, there exist an opportunity to integrate skew-aware reordering techniques with the dataset generation process; by doing so, wecan eliminate the need to regenerate CSR-like structure post vertex reordering, whichdominates the reordering cost. At the very least, the cost of a reordering techniqueshould be compared to the cost of a post-processing technique used over the rawdataset to understand the cost-benefit trade-offs of techniques from different domains.

Amortizing reordering costs on dynamic graphs:

In this thesis, we assumed thatgraphs are static, and thus have evaluated a net speed-up conservatively assumingonly one graph application (or query) over the reordered dataset (refer to Fig. 5.9 inChapter 5). In practice, a graph may evolve over time and a stream of graph updates(i.e., addition or removal of vertices or edges) are interleaved with graph-analyticqueries. For such a deployment, graph reordering may provide an even greater benefitas the reordering cost can be amortized not only over multiple graph traversals of asingle query, but also over multiple graph queries. Intuitively, addition or removalof some vertices or edges in a large graph would not lead to a drastic change in thedegree distribution, and thus unlikely to change which vertices are classified hot in ashort time window. Therefore, reordering techniques may need to be re-applied atlarge periodic intervals (i.e., after a series of updates has been made to a graph) toimprove cache behavior, amortizing the cost of reordering over multiple graph queriesperformed in a given interval.

In this thesis, we emphasized the need for robust cache management mechanismsand policies for LLC to minimize cache misses in the face of variability in the reuse Chapter 7. Conclusions and Future Work behavior of cache blocks. To that end, we proposed two cache management techniques,employing new variability-tolerant features such as a new metric (Live Distance) andadaptive reuse-aware policies by Leeway, and software-guided cache management forgraph analytics by GRASP. While these features are used by our proposed techniquesin a specific way, we believe, they can potentially be integrated with other cachemanagement techniques to make them robust in addressing variability in reuseprediction for LLC. ibliography [1] P. Faldu, J. Diamond, and B. Grot. “Domain-Specialized Cache Management for GraphAnalytics”. In:

IEEE International Symposium on High-Performance ComputerArchitecture . HPCA’20. IEEE, Feb. 2020. doi: 10.1109/HPCA47549.2020.00028 (cit. onp. 7).[2] L. Dhulipala, G. E. Blelloch, and J. Shun. “Low-latency Graph Streaming UsingCompressed Purely-functional Trees”. In:

International Conference on ProgrammingLanguage Design and Implementation . PLDI 2019. Association for ComputingMachinery, June 2019. doi: 10.1145/3314221.3314598 (cit. on p. 113).[3] P. Faldu, J. Diamond, and B. Grot. “A Closer Look at Lightweight Graph Reordering”.In:

IEEE International Symposium on Workload Characterization . IISWC’19. IEEE, Nov.2019. doi: 10.1109/IISWC47752.2019.9041948 (cit. on p. 7).[4] P. Faldu, J. Diamond, and B. Grot. “POSTER: Domain-Specialized Cache Managementfor Graph Analytics”. In:

International Conference on Parallel Architectures andCompilation Techniques . PACT’19. IEEE, Sept. 2019. doi: 10.1109/PACT.2019.00051(cit. on p. 7).[5] P. Faldu, J. Diamond, and A. Patel. “Cache Memory Architecture and Policies forAccelerating Graph Algorithms”. U.S. pat. 10417134. Oracle International Corporation.Sept. 2019 (cit. on p. 7).[6] V. Balaji and B. Lucia. “When is Graph Reordering an Optimization? Studying theEffect of Lightweight Graph Reordering Across Applications and Input Graphs”. In:

IEEE International Symposium on Workload Characterization . IISWC’18. Sept. 2018. doi:10.1109/IISWC.2018.8573478 (cit. on pp. 5, 60, 65, 75, 79, 82, 89).[7]

Hyperlink Graphs . http://webdatacommons.org/hyperlinkgraph. Web Data Commons,2018 (cit. on pp. 81, 100).[8] A. Jain and C. Lin. “Rethinking Belady’s Algorithm to Accommodate Prefetching”. In:

International Symposium on Computer Architecture . ISCA’18. IEEE Press, June 2018.doi: 10.1109/ISCA.2018.00020 (cit. on pp. 3, 15, 20, 23, 67, 93, 99, 107, 120).[9] A. Mukkara, N. Beckmann, M. Abeydeera, X. Ma, and D. Sanchez. “Exploiting Locality inGraph Analytics through Hardware-Accelerated Traversal Scheduling”. In:

Proceedingsof the ACM/IEEE International Symposium on Microarchitecture . MICRO-51. IEEE Press,Oct. 2018. doi: 10.1109/MICRO.2018.00010 (cit. on pp. 85, 91).[10] N. Vijaykumar, A. Jain, D. Majumdar, K. Hsieh, G. Pekhimenko, E. Ebrahimi, N.Hajinazar, P. B. Gibbons, and O. Mutlu. “A Case for Richer Cross-Layer Abstractions:Bridging the Semantic Gap with Expressive Memory”. In:

International Symposium onComputer Architecture . ISCA’18. IEEE Press, June 2018. doi: 10.1109/ISCA.2018.00027(cit. on pp. 15, 23, 24, 67, 93, 97, 103).

BIBLIOGRAPHY [11]

AMD Zen Microarchitecutres . https://en.wikichip.org/wiki/amd/microarchitectures/zen.2017 (cit. on p. 2).[12]

ChampSim: A Trace-based Cycle-accurate Simulator . https://github.com/ChampSim/ChampSim. June 2017 (cit. on p. 55).[13] J. Díaz, P. Ibáñez, T. Monreal, V. Viñals, and J. Llabería. “ReD: A Policy Based on ReuseDetection for a Demanding Block Selection in Last-Level Caches”. In:

InternationalWorkshop on Cache Replacement Championship, co-located with ISCA . CRC2. http ://crc2.ece.tamu.edu. June 2017 (cit. on pp. 23, 55, 120).[14] P. Faldu and B. Grot. “Reuse-Aware Management for Last-Level Caches”. In:

International Workshop on Cache Replacement Championship, co-located with ISCA .CRC2. http://crc2.ece.tamu.edu. June 2017 (cit. on pp. 7, 23, 120).[15] P. Faldu and B. Grot. “Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches”. In:

International Conference on Parallel Architectures and CompilationTechniques . PACT’17. IEEE, Sept. 2017. doi: 10.1109/PACT.2017.32 (cit. on p. 7).[16] J. L. Hennessy and D. A. Patterson.

Computer Architecture, Sixth Edition: A QuantitativeApproach . 6th. https://dl.acm.org/doi/10.5555/3207796. Morgan Kaufmann PublishersInc., 2017. isbn: 0128119055 (cit. on p. 9).[17] A. Jain and C. Lin. “Hawkeye Cache Replacement: Leveraging Belady’s Algorithmfor Improved Cache Replacement”. In:

International Workshop on Cache ReplacementChampionship, co-located with ISCA . CRC2. http://crc2.ece.tamu.edu. June 2017 (cit. onpp. 23, 55, 120).[18] D. A. Jiménez and E. Teran. “Multiperspective Reuse Prediction”. In:

Proceedings of theIEEE/ACM International Symposium on Microarchitecture . MICRO-50. Association forComputing Machinery, Oct. 2017. doi: 10.1145/3123939.3123942 (cit. on pp. 3, 15, 20,23, 67, 93, 120).[19] D. A. Jiménez. “Multiperspective Reuse Prediction”. In:

International Workshop on CacheReplacement Championship, co-located with ISCA . CRC2. http://crc2.ece.tamu.edu. June2017 (cit. on pp. 23, 55, 120).[20] J. Kim and P. V. Gratz.

The 2nd Cache Replacement Championship, co-located with ISCA .CRC2. http://crc2.ece.tamu.edu. June 2017 (cit. on pp. 23, 54, 103).[21] J. Kim, E. Teran, P. V. Gratz, D. A. Jiménez, S. H. Pugsley, and C. Wilkerson. “Kill theProgram Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy”.In:

Proceedings of the International Conference on Architectural Support for ProgrammingLanguages and Operating Systems . ASPLOS’17. Association for Computing Machinery,Apr. 2017. doi: 10.1145/3037697.3037701 (cit. on p. 23).[22] K. Lakhotia, S. Singapura, R. Kannan, and V. Prasanna. “ReCALL: Reordered CacheAware Locality Based Graph Processing”. In:

IEEE International Conference on HighPerformance Computing . HiPC’17. IEEE, Dec. 2017. doi: 10.1109/HiPC.2017.00039(cit. on p. 65).[23]

Twitter (MPI) network dataset – KONECT . http://konect.uni- koblenz.de/networks/twitter_mpi. The Koblenz Network Collection, 2017 (cit. on p. 81).[24] A. Vakil-Ghahani, S. Mahdizadeh-Shahri, M. Lotfi-Namin, M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad. “Cache Replacement Policy Based on Expected HitCount”. In:

International Workshop on Cache Replacement Championship, co-locatedwith ISCA . CRC2. http://crc2.ece.tamu.edu. June 2017 (cit. on pp. 23, 120).

IBLIOGRAPHY [25] J. Wang, L. Zhang, R. Panda, and L. John. “Less is More: Leveraging Belady’s Algorithmwith Demand-based Learning”. In:

International Workshop on Cache ReplacementChampionship, co-located with ISCA . CRC2. http : / / crc2 . ece . tamu . edu. June 2017(cit. on pp. 23, 55, 120).[26]

Wikipedia, English network dataset – KONECT . http://konect.uni-koblenz.de/networks/dbpedia-link. The Koblenz Network Collection, 2017 (cit. on p. 81).[27] V. Young, C. Chou, A. Jaleel, and M. K. Qureshi. “SHiP++: Enhancing Signature-BasedHit Predictor for Improved Cache Performance”. In:

International Workshop on CacheReplacement Championship, co-located with ISCA . CRC2. http://crc2.ece.tamu.edu. June2017 (cit. on pp. 23, 55, 120).[28] Y. Zhang, V. Kiriansky, C. Mendis, S. Amarasinghe, and M. Zaharia. “Making cacheswork for graph analytics”. In:

IEEE International Conference on Big Data . Big Data’17.IEEE, Dec. 2017. doi: 10.1109/BigData.2017.8257937 (cit. on pp. 5, 60, 65, 71, 75, 79, 82,91, 109).[29] S. Ainsworth and T. M. Jones. “Graph Prefetching Using Data Structure Knowledge”.In:

International Conference on Supercomputing . ICS’16. Association for ComputingMachinery, June 2016. doi: 10.1145/2925426.2926254 (cit. on pp. 97, 113).[30] J. Arai, H. Shiokawa, T. Yamamuro, M. Onizuka, and S. Iwamura. “Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis”. In:

IEEE International Paralleland Distributed Processing Symposium . IPDPS’16. IEEE, May 2016. doi: 10.1109/IPDPS.2016.110 (cit. on p. 65).[31] N. Beckmann and D. Sanchez. “Modeling Cache Performance Beyond LRU”. In:

IEEEInternational Symposium on High-Performance Computer Architecture . HPCA’16. Mar.2016. doi: 10.1109/HPCA.2016.7446067 (cit. on pp. 56, 57).[32] P. Faldu and B. Grot. “LLC Dead Block Prediction Considered Not Useful”. In:

International Workshop on Duplicating, Deconstructing and Debunking, co-located withISCA . WDDD-13. June 2016 (cit. on pp. 7, 36, 107).[33]

Friendster network dataset – KONECT . http : / / konect . uni - koblenz . de / networks /friendster. The Koblenz Network Collection, 2016 (cit. on pp. 81, 100).[34] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. “Graphicionado: A high-performance and energy-efficient accelerator for graph analytics”. In:

Proceedings ofthe ACM/IEEE International Symposium on Microarchitecture . MICRO-49. IEEE Press,Oct. 2016. doi: 10.1109/MICRO.2016.7783759 (cit. on p. 91).[35]

Intel Broadwell Microarchitectures .https://en.wikichip.org/wiki/intel/microarchitectures/broadwell_(client). 2016 (cit. onp. 2).[36]

Intel Xeon Processor E5-2630 v4 . https://ark.intel.com/products/92981/Intel- Xeon-Processor-E5-2630-v4-25M-Cache-2_20-GHz. Intel Corporation, 2016 (cit. on p. 82).[37] A. Jain and C. Lin. “Back to the Future: Leveraging Belady’s Algorithm for ImprovedCache Replacement”. In:

International Symposium on Computer Architecture . ISCA’16.IEEE Press, June 2016. doi: 10.1109/ISCA.2016.17 (cit. on pp. 3, 4, 15, 20–23, 26–28, 41,44, 45, 55, 66, 93, 99, 102, 104, 105, 107, 120). BIBLIOGRAPHY [38] A. Mukkara, N. Beckmann, and D. Sanchez. “Whirlpool: Improving Dynamic CacheManagement with Static Data Classification”. In:

International Conference onArchitectural Support for Programming Languages and Operating Systems . ASPLOS ’16.Association for Computing Machinery, Mar. 2016. doi: 10 . 1145 / 2872362 . 2872363(cit. on p. 97).[39] E. Teran, Y. Tian, Z. Wang, and D. A. Jiménez. “Minimal disturbance placementand promotion”. In:

IEEE International Symposium on High-Performance ComputerArchitecture . HPCA’16. IEEE, Mar. 2016. doi: 10.1109/HPCA.2016.7446065 (cit. on pp. 3,4, 15, 16, 20, 23, 27, 28, 67, 93, 99, 107, 120).[40] E. Teran, Z. Wang, and D. A. Jiménez. “Perceptron Learning for Reuse Prediction”. In:

Proceedings of the IEEE/ACM International Symposium on Microarchitecture . MICRO-49.IEEE Press, Oct. 2016. doi: 10.1109/MICRO.2016.7783705 (cit. on pp. 3, 4, 15, 20, 23, 27,28, 57, 67, 93, 99, 107, 120).[41] H. Wei, J. X. Yu, C. Lu, and X. Lin. “Speedup Graph Processing by Graph Ordering”.In:

International Conference on Management of Data . SIGMOD’16. Association forComputing Machinery, June 2016. doi: 10.1145/2882903.2915220 (cit. on pp. 65, 79, 82,109).[42] S. Beamer, K. Asanovic, and D. A. Patterson. “The GAP Benchmark Suite”. In:

CoRR (2015). http://arxiv.org/abs/1508.03619 (cit. on pp. 61, 81, 91, 100, 112).[43] S. Das, T. M. Aamodt, and W. J. Dally. “Reuse Distance-Based Probabilistic CacheReplacement”. In:

ACM Transactions on Architecture and Code Optimization

International Conference for HighPerformance Computing, Networking, Storage and Analysis . SC’15. Association forComputing Machinery, Nov. 2015. doi: 10.1145/2807591.2807620 (cit. on p. 112).[45] F. Khorasani, R. Gupta, and L. N. Bhuyan. “Scalable SIMD-Efficient Graph Processing onGPUs”. In:

International Conference on Parallel Architectures and Compilation Techniques .PACT’15. IEEE, Oct. 2015. doi: 10.1109/PACT.2015.15 (cit. on p. 81).[46] P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer. “LLAMA: Efficient graphanalytics using Large Multiversioned Arrays”. In:

IEEE International Conference onData Engineering . ICDE’15. Apr. 2015. doi: 10.1109/ICDE.2015.7113298 (cit. on p. 113).[47] R. A. Rossi and N. K. Ahmed. “The Network Data Repository with Interactive GraphAnalytics and Visualization”. In:

Proceedings of the AAAI Conference on ArtificialIntelligence . AAAI’15. http : / / networkrepository. com / road - road - usa . php. AAAIPress, Jan. 2015 (cit. on p. 81).[48] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson,S. G. Vadlamudi, D. Das, and P. Dubey. “GraphMat: High Performance GraphAnalytics Made Productive”. In:

Proceedings of the VLDB Endowment

Proceedings of the ACM/IEEE International Symposium on Microarchitecture . MICRO-48.Association for Computing Machinery, Dec. 2015. doi: 10.1145/2830772.2830807 (cit. onp. 113).

IBLIOGRAPHY [50] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout. “An Evaluation ofHigh-Level Mechanistic Core Models”. In:

ACM Transactions on Architecture and CodeOptimization

SNAP Datasets: Stanford Large Network Dataset Collection .http://snap.stanford.edu/data. 2014 (cit. on pp. 81, 100).[52] X. Tong and A. Moshovos. “BarTLB: Barren page resistant TLB for managed runtimelanguages”. In:

International Conference on Computer Design . ICCD’14. IEEE, Oct. 2014.doi: 10.1109/ICCD.2014.6974692 (cit. on p. 97).[53] J. Brock, X. Gu, B. Bao, and C. Ding. “Pacman: Program-assisted Cache Management”.In:

International Symposium on Memory Management . ISMM’13. Association forComputing Machinery, June 2013. doi: 10.1145/2491894.2466482 (cit. on pp. 15, 23, 24).[54] D. A. Jiménez. “Insertion and promotion for tree-based PseudoLRU last-level caches”.In:

Proceedings of the ACM/IEEE International Symposium on Microarchitecture . MICRO-46. Association for Computing Machinery, Dec. 2013. doi: 10.1145/2540708.2540733(cit. on pp. 3, 15, 16, 18, 93).[55] D. Nguyen, A. Lenharth, and K. Pingali. “A Lightweight Infrastructure for GraphAnalytics”. In:

Proceedings of the ACM Symposium on Operating Systems Principles .SOSP’13. Association for Computing Machinery, Nov. 2013. doi: 10.1145/2517349.2522739 (cit. on pp. 61, 91, 112).[56] R. Sen and D. A. Wood. “Reuse-based Online Models for Caches”. In:

Proceedingsof the ACM SIGMETRICS International Conference on Measurement and Modeling ofComputer Systems . SIGMETRICS’13. Association for Computing Machinery, June 2013.doi: 10.1145/2465529.2465756 (cit. on p. 56).[57] J. Shun and G. E. Blelloch. “Ligra: A Lightweight Graph Processing Framework forShared Memory”. In:

Proceedings of the ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming . PPoPP ’13. Association for Computing Machinery,Feb. 2013. doi: 10.1145/2442516.2442530 (cit. on pp. 61, 80, 81, 91, 100, 112).[58]

CloudSuite: The Benchmark Suite of Cloud Services . http://cloudsuite.ch. 2012 (cit. onp. 55).[59] N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum. “ImprovingCache Management Policies Using Dynamic Reuse Distances”. In:

Proceedings of theACM/IEEE International Symposium on Microarchitecture . MICRO-45. IEEE ComputerSociety, Dec. 2012. doi: 10.1109/MICRO.2012.43 (cit. on pp. 3, 15, 27, 31, 34, 56, 57, 93).[60] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. “Stinger: High performance datastructure for streaming graphs”. In:

IEEE International Conference on High PerformanceExtreme Computing . HPEC’12. Sept. 2012. doi: 10.1109/HPEC.2012.6408680 (cit. onp. 113).[61] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. “PowerGraph: DistributedGraph-parallel Computation on Natural Graphs”. In:

USENIX Symposium on OperatingSystems Design and Implementation

USENIX Symposium on Operating Systems Design and Implementation BIBLIOGRAPHY [63] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry. “The Evicted-address Filter: AUnified Mechanism to Address Both Cache Pollution and Thrashing”. In:

InternationalConference on Parallel Architectures and Compilation Techniques . PACT’12. Associationfor Computing Machinery, Sept. 2012. doi: 10.1145/2370816.2370868 (cit. on pp. 3, 15,93).[64] I. Stanton and G. Kliot. “Streaming Graph Partitioning for Large Distributed Graphs”.In:

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining .KDD’12. Aug. 2012. doi: 10.1145/2339530.2339722 (cit. on p. 65).[65] P. Boldi, M. Rosa, M. Santini, and S. Vigna. “Layered Label Propagation: AMultiresolution Coordinate-free Ordering for Compressing Social Networks”. In:

Proceedings of the International Conference on World Wide Web . WWW’11. Associationfor Computing Machinery, Mar. 2011. doi: 10.1145/1963405.1963488 (cit. on p. 121).[66] U. Kang and C. Faloutsos. “Beyond ‘Caveman Communities’: Hubs and Spokes forGraph Compression and Mining”. In:

IEEE International Conference on Data Mining .ICDM’11. IEEE, Dec. 2011. doi: 10.1109/ICDM.2011.26 (cit. on p. 65).[67] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr., and J. Emer. “SHiP:Signature-based Hit Predictor for High Performance Caching”. In:

Proceedings of theACM/IEEE International Symposium on Microarchitecture . MICRO-44. Association forComputing Machinery, Dec. 2011. doi: 10.1145/2155620.2155671 (cit. on pp. 3, 15,20–23, 26, 27, 35, 38, 41, 43, 45, 55, 67, 93, 99, 102, 104–107, 120).[68] A. R. Alameldeen, A. Jaleel, M. K. Qureshi, and J. Emer.

JILP Workshop on ComputerArchitecture Competitions: Cache Replacement Championship

In JILP Workshop on Computer Architecture Competitions: CacheReplacement Championship

Proceedings of the IEEE/ACM International Symposium onMicroarchitecture . MICRO-43. IEEE Computer Society, Dec. 2010. doi: 10.1109/MICRO.2010.52 (cit. on p. 119).[71] A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer. “High Performance CacheReplacement Using Re-reference Interval Prediction (RRIP)”. In:

InternationalSymposium on Computer Architecture . ISCA’10. Association for Computing Machinery,June 2010. doi: 10.1145/1815961.1815971 (cit. on pp. 3, 12, 15, 16, 18, 19, 38, 43, 66, 93,99, 102, 107).[72] S. M. Khan, D. A. Jiménez, D. Burger, and B. Falsafi. “Using Dead Blocks as a VirtualVictim Cache”. In:

Proceedings of the International Conference on Parallel Architecturesand Compilation Techniques . PACT ’10. Association for Computing Machinery, Sept.2010. doi: 10.1145/1854273.1854333 (cit. on p. 57).[73] S. M. Khan, Y. Tian, and D. A. Jiménez. “Sampling Dead Block Prediction for Last-LevelCaches”. In:

Proceedings of the ACM/IEEE International Symposium on Microarchitecture .MICRO-43. IEEE Computer Society, Dec. 2010. doi: 10.1109/MICRO.2010.24 (cit. onpp. 3, 4, 15, 20–23, 26–28, 30, 31, 41, 43, 45, 67, 93, 99, 106, 107, 120).

IBLIOGRAPHY [74] H. Kwak, C. Lee, H. Park, and S. Moon. “What is Twitter, a social network or a newsmedia?” In:

International Conference on World Wide Web . WWW’10. Association forComputing Machinery, Apr. 2010. doi: 10.1145/1772690.1772751 (cit. on pp. 81, 100).[75] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. “GraphLab:A New Framework For Parallel Machine Learning”. In:

The Conference on Uncertaintyin Artificial Intelligence . UAI’10. https://dl.acm.org/doi/10.5555/3023549.3023589. AUAIPress, July 2010 (cit. on pp. 61, 91, 112).[76] M. Chaudhuri. “Pseudo-LIFO: The Foundation of a New Family of Replacement Policiesfor Last-level Caches”. In:

Proceedings of the ACM/IEEE International Symposium onMicroarchitecture . MICRO-42. Dec. 2009. doi: 10.1145/1669112.1669164 (cit. on pp. 3,15, 57, 93).[77] C. Magnien, M. Latapy, and M. Habib. “Fast Computation of Empirically Tight Boundsfor the Diameter of Massive Graphs”. In:

Journal of Experimental Algorithmics

13 (Feb.2009). doi: 10.1145/1412228.1455266 (cit. on p. 80).[78] Y. Xie and G. H. Loh. “PIPP: Promotion/Insertion Pseudo-partitioning of Multi-coreShared Caches”. In:

International Symposium on Computer Architecture . ISCA’09.Association for Computing Machinery, June 2009. doi: 10 . 1145 / 1555754 . 1555778(cit. on pp. 3, 15, 93).[79] A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. “CMP$im: A Pin-Based On-The-FlyMulti-Core Cache Simulator”. In:

International Workshop on Modeling, Benchmarkingand Simulation (MoBS) . 2008 (cit. on pp. 42, 55).[80] A. Jaleel, W. Hasenplaugh, M. K. Qureshi, J. Sebot, S. Steely Jr., and J. Emer. “AdaptiveInsertion Policies for Managing Shared Caches”. In:

International Conference onParallel Architectures and Compilation Techniques . PACT’08. Association forComputing Machinery, Oct. 2008. doi: 10.1145/1454115.1454145 (cit. on pp. 3, 15, 18,93).[81] M. Kharbutli and Y. Solihin. “Counter-Based Cache Replacement and BypassingAlgorithms”. In:

IEEE Transactions on Computers

The Workshop on ChipMultiprocessor Memory Systems and Interconnects . 2008 (cit. on pp. 3, 15, 18, 93).[83] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. “Statistical Properties ofCommunity Structure in Large Social and Information Networks”. In:

InternationalConference on World Wide Web . WWW’08. Association for Computing Machinery, Apr.2008. doi: 10.1145/1367497.1367591 (cit. on pp. 60, 71).[84] H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach forEliminating Dead Blocks and Increasing Cache Efficiency”. In:

IEEE/ACM InternationalSymposium on Microarchitecture . MICRO-41. IEEE, Nov. 2008. doi: 10.1109/MICRO.2008.4771793 (cit. on p. 34).[85] G. Keramidas, P. Petoumenos, and S. Kaxiras. “Cache replacement based on reuse-distance prediction”. In:

International Conference on Computer Design . ICCD’07. IEEE,Oct. 2007. doi: 10.1109/ICCD.2007.4601909 (cit. on pp. 3, 15, 20, 31, 56, 93). BIBLIOGRAPHY [86] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. “Adaptive Insertion Policiesfor High Performance Caching”. In:

International Symposium on Computer Architecture .ISCA’07. Association for Computing Machinery, June 2007. doi: 10.1145/1250662.1250709 (cit. on pp. 3, 12, 15–19, 37, 40, 93, 99).[87] K. Rajan and G. Ramaswamy. “Emulating Optimal Replacement with a ShepherdCache”. In:

Proceedings of the ACM/IEEE International Symposium on Microarchitecture .MICRO-40. IEEE, Dec. 2007. doi: 10.1109/MICRO.2007.25 (cit. on pp. 3, 15, 93).[88] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. “A case for MLP-aware cachereplacement”. In:

International Symposium on Computer Architecture . ISCA’06. IEEEComputer Society, May 2006. doi: 10.1109/ISCA.2006.5 (cit. on pp. 3, 15, 93).[89] R. Subramanian, Y. Smaragdakis, and G. H. Loh. “Adaptive Caches: Effective Shapingof Cache Behavior to Workloads”. In:

Proceedings of the IEEE/ACM InternationalSymposium on Microarchitecture . MICRO-39. IEEE Computer Society, Dec. 2006. doi:10.1109/MICRO.2006.7 (cit. on pp. 3, 15, 18, 93).[90] J. Abella, A. González, X. Vera, and M. F. P. O’Boyle. “IATAC: A Smart Predictor toTurn-off L2 Cache Lines”. In:

ACM Transactions on Architecture and Code Optimization(TACO)

Journal of Systems Architecture

SIAM International Conference on Data Mining . Apr. 2004. doi: 10.1137/1.9781611972740.43 (cit. on pp. 81, 100).[93] C. Ding and Y. Zhong. “Predicting Whole-program Locality Through Reuse DistanceAnalysis”. In:

Proceedings of the ACM SIGPLAN Conference on Programming LanguageDesign and Implementation . PLDI ’03. Association for Computing Machinery, May2003. doi: 10.1145/781131.781159 (cit. on p. 56).[94] Z. Hu, S. Kaxiras, and M. Martonosi. “Let Caches Decay: Reducing Leakage Energyvia Exploitation of Cache Generational Behavior”. In:

ACM Transactions on ComputerSystems

Proceedings of the ACMSIGMETRICS International Conference on Measurement and Modeling of ComputerSystems . SIGMETRICS’03. Association for Computing Machinery, June 2003. doi:10.1145/781027.781076 (cit. on p. 42).[96] M. Girvan and M. E. J. Newman. “Community structure in social and biologicalnetworks”. In:

The National Academy of Sciences

International Symposium on ComputerArchitecture . ISCA’02. IEEE, May 2002. doi: 10.1109/ISCA.2002.1003579 (cit. on pp. 3,15, 20, 31, 93).

IBLIOGRAPHY [98] C. Kim, D. Burger, and S. W. Keckler. “An adaptive, non-uniform cache structure forwire-delay dominated on-chip caches”. In:

Proceedings of the International Conferenceon Architectural Support for Programming Languages and Operating Systems . ASPLOS X.Association for Computing Machinery, Oct. 2002. doi: 10.1145/605397.605420 (cit. onp. 9).[99] Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. “Using the compilerto improve cache replacement decisions”. In:

International Conference on ParallelArchitectures and Compilation Techniques . PACT’02. IEEE Computer Society, Sept.2002. doi: 10.1109/PACT.2002.1106018 (cit. on pp. 15, 23).[100] P. Jain, S. Devadas, D. Engels, and L. Rudolph. “Software-assisted cache replacementmechanisms for embedded systems”. In:

IEEE/ACM International Conference onComputer Aided Design . ICCAD’01. IEEE, Nov. 2001. doi: 10.1109/ICCAD.2001.968607(cit. on pp. 15, 23).[101] S. Kaxiras, Zhigang Hu, and M. Martonosi. “Cache decay: exploiting generationalbehavior to reduce cache leakage power”. In:

International Symposium on ComputerArchitecture . ISCA’01. IEEE, June 2001. doi: 10.1109/ISCA.2001.937453 (cit. on p. 57).[102] A.-C. Lai, C. Fide, and B. Falsafi. “Dead-block Prediction & Dead-block CorrelatingPrefetchers”. In:

International Symposium on Computer Architecture . ISCA’01.Association for Computing Machinery, May 2001. doi: 10.1145/379240.379259 (cit. onpp. 20, 30, 57).[103] D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. “LRFU: a spectrumof policies that subsumes the least recently used and least frequently used policies”.In:

IEEE Transactions on Computers

International Symposium on Computer Architecture . ISCA’00.IEEE, June 2000. doi: 10.1109/ISCA.2000.854385 (cit. on pp. 20, 30, 57).[105] A.-L. Barabási and R. Albert. “Emergence of Scaling in Random Networks”. In:

Science

The Conference on Applications, Technologies, Architectures, andProtocols for Computer Communication . SIGCOMM ’99. Association for ComputingMachinery, Aug. 1999. doi: 10.1145/316188.316229 (cit. on pp. 5, 60).[107] A. R. Lebeck, D. R. Raymond, C.-L. Yang, and M. Thottethodi. “Annotated MemoryReferences: A Mechanism for Informed Cache Management”. In:

Euro-Par Conference onParallel Processing . Euro-Par’99. Springer Berlin Heidelberg, Aug. 1999. doi: 10.1007/3-540-48311-X_177 (cit. on pp. 15, 23).[108] L. Page, S. Brin, R. Motwani, and T. Winograd.

The PageRank Citation Ranking: BringingOrder to the Web . Technical Report 1999-66. http : / / ilpubs . stanford . edu : 8090 / 422.Stanford InfoLab, 1999 (cit. on p. 80).[109] G. Karypis and V. Kumar. “Multilevel k-way Partitioning Scheme for Irregular Graphs”.In:

Journal of Parallel and Distributed Computing

The Cache Memory Book . https://dl.acm.org/doi/10.5555/157953. AcademicPress Professional, Inc., 1993. isbn: 0123229855 (cit. on pp. 3, 15, 16, 93). BIBLIOGRAPHY [111] J. Banerjee, W. Kim, S. Kim, and J. F. Garza. “Clustering a DAG for CAD databases”.In:

IEEE Transactions on Software Engineering

IBM Systems Journal

Proceedings of the National Conference . ACM ’69. Association for Computing Machinery,Aug. 1969. doi: 10.1145/800195.805928 (cit. on p. 65).[114] L. A. Belady. “A Study of Replacement Algorithms for a Virtual-storage Computer”. In: