Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads
11 Helper Without Threads: CustomizedPrefetching for Delinquent Irregular Loads
Karthik Sankaranarayanan, Chit-Kwan Lin and Gautham Chinya,E-mail: [email protected] Corporation, 2111 NE 25th Ave, Hillsboro, OR 97124 (cid:70)
Abstract —The growing memory footprints of cloud and big data appli-cations mean that data center CPUs can spend significant time waitingfor memory. An attractive approach to improving performance in suchcentralized compute settings is to employ prefetchers that are customizedper application, where gains can be easily scaled across thousands ofmachines. Helper thread prefetching is such a technique but has yet toachieve wide adoption since it requires spare thread contexts or specialhardware/firmware support. In this paper, we propose an inline softwareprefetching technique that overcomes these restrictions by inserting thehelper code into the main thread itself. Our approach is complementaryto and does not interfere with existing hardware prefetchers since wetarget only delinquent irregular load instructions (those with no constantor striding address patterns). For each chosen load instruction, wegenerate and insert a customized software prefetcher extracted from andmimicking the application’s dataflow, all without access to the applicationsource code. For a set of irregular workloads that are memory-bound,we demonstrate up to 2X single-thread performance improvement onrecent high-end hardware (Intel Skylake) and up to 83% speedup over ahelper thread implementation on the same hardware, due to the absenceof thread spawning overhead.
NTRODUCTION
The rise of the cloud and big data has caused the memory footprintof applications to grow faster than the pace of technology scaling(i.e., memory capacity and core counts). Moreover, as data parallelworkloads increasingly move away from the CPU into GPUs,FPGAs and accelerators, the CPU is faced with a rise of irregularmemory applications. As a result, data center CPUs can spend asignificant fraction of execution cycles waiting for the caches [33].Yet, despite the large core counts available, Amdahl’s Law meansthat mitigating such single-thread performance bottlenecks remainscrucial to achieving improved overall performance [26], [74].Interestingly, the centralization of compute in the data center can beseen as an opportunity to be exploited. By customizing performanceoptimizations per application , gains can be scaled across manythousands of machines. This approach relies on obtaining intimateknowledge of an application’s behavior through profiling andhardware performance counters [71], and using such informationto extract optimal performance from the hardware for the targetapplication.Speculative precomputation [12], [13], [15], [16], [20], [21],[43], [59], [67], [75], otherwise known as helper threading [31],[32], [34], [35], [41], [42], [68] is such a technique. It reduces thesingle-thread latency of an application by using idle thread contextsin the hardware to spawn special-purpose, speculative threads called helper threads . Helper threads contain computation extracted fromthe main thread and consume the latency of execution on behalf ofit. They encounter cache misses and branch mispredictions aheadof the main thread, and act as execution-driven prefetchers orbranch predictors for the main application, thereby improving itslatency significantly. Their benefit accrues from the fact that theyare tailored to the specific application they are extracted from, andtherefore orchestrate the hardware precisely to suit its needs.However, helper threads can be be tricky to implement effi-ciently. Although the original ideas appeared over two decades ago,we are not aware of any commercial processors with hardwaresupport for helper threads today. Note that industry-strengthcompiler support is available for code generation of helper threadson multicore CPUs [1], [3]. However, the corresponding hardwaresupport for generating low-overhead micro/nano threads (for e.g.,as in [12]) is absent. Hence, the dynamic thread spawn overheadis still significant in current operating systems. In the absenceof such specialized hardware support, helper threads have twodisadvantages today: (1) the need for spare thread contexts; and(2) the difficulty in synchronizing and match the rates of the mainapplication and the helper threads. In this work, we overcome boththese limitations with an inline prefetching technique inspired bysoftware pipelining [37], [56], yet our method retains an importantbenefit of helper threads: it works without access or modificationto the application source code. This makes our technique attractiveto cloud service providers who run third-party applications at scale.
Load instructions in a program can fall into three categories: (a)constant address, (b) striding, and (c) irregular. Constant addressloads are loads whose virtual address does not change over multipledynamic instances of the load (for e.g., global variables and stackaccesses). Striding loads are those with successive virtual addressesfollowing an arithmetic progression (for e.g., array accesses).
Irregular loads are those which do not fall into either of theabove two categories (for e.g., indirect and pointer references).Furthermore, loads that frequently miss in the cache are said tobe delinquent . While current hardware mechanisms are effectiveat prefetching regular address patterns [22], delinquent irregularloads (DILs) remain a challenge.For a set of 165 traces from commercial applications identifiedby our framework as containing prefetchable DILs (see Section 4for a formal definition), Figure 1 shows the fraction of CPU cycles a r X i v : . [ c s . A R ] S e p ✁(cid:0)✂(cid:0)✄(cid:0)☎(cid:0)✆(cid:0)✝(cid:0)✞(cid:0)✟(cid:0)✠(cid:0)✁(cid:0)(cid:0) ✡ ☛✡ ☞✡ ✌✡ ✍✡ ✎✡ ✏✡ ✑✡ ✒✡ ✓✡ ☛✡✡ ☛☛✡ ☛☞✡ ☛✌✡ ☛✍✡ ☛✎✡ ☛✏✡✔✕✖✗✘✙✙✚✘✛✜✢✛✣✤✘✥✘✜✖✦✧★✗✘✩✪✫✬✭✮ ✯✰✱✲✳ ✴✵✶✷✳✰ Fig. 1: Fraction of CPU cycles spent waiting for data by specificmemory-bound, prefetchable delinquent irregular loads (DILs).spent stalled waiting for data at the time of retirement of the specificDILs. The traces are from several client applications (productivity,games, content creation), server applications (cloud, database, en-terprise, HPC) and CPU benchmarks simulated on a cycle-accuratesimulator modeling the Intel Skylake [18] microarchitecture. Onaverage, each trace has about three prefetchable DILs that arememory-bound. Should all that stall time be reduced to zero, thepotential geometric mean speedup possible in these traces is 15%,which is significant headroom. However, these opportunities wereidentified from a universe of over 2000 traces and building aprefetcher in hardware for such a narrow focus is not a profitable microarchitectural trade-off since such a prefetcher will be unusedmost of the time. On the other hand, large silicon and softwarecompanies currently employ significant software resources formanually optimizing select applications (e.g., Figure 5a is from asearch engine). The scale of these high-value applications justifiesthe extra effort spent in optimizing them ( [9], [65]). As long asthese applications fall into the right side of figure 1, targetingthe DILs from them through software is a better strategy thanimplementing a specialized prefetcher in hardware. Hence, wepropose an inline software prefetching technique that selectivelytargets DILs that are memory-bound for prefetching.
Load miss performance countersCandidateRegions CustomPrefetchersProfiler Optimizer Prefetcher
Address Delta ProfilingDelinquent Memory Bound Loads Filter Dataflow AnalysisCycle EnumerationPrefetchable Load Identification Generated Code
Fig. 2: An overview of our customized prefetching approach, whichrequires no access to program source code.Figure 2 summarizes our method. Prefetchers (either in hardware or in software) must meet accuracy, timeliness, andbandwidth criteria; specifically, an issued prefetch must be tothe correct address just ahead of the actual demand, and mustnot throttle demand loads by consuming too much memorybandwidth. When these criteria are not met, the cache is pollutedand performance suffers. Software prefetchers face an additionalchallenge of unintended interactions with hardware prefetchers [38].We expressly designed our software prefetcher to avoid thesecomplications by placing an emphasis on the selection of theload instructions that we prefetch: we only target loads that areextremely difficult for the hardware to prefetch, namely, memory-bound DILs. Through detailed profiling and dynamic dataflowanalysis [36], our method identifies candidate memory-boundDILs that are part of inner loops and are likely to benefit fromprefetching with minimal software prefetcher complexity. We callsuch DILs prefetchable and generate customized prefetching code for each. Similar to helper threads [34], [68], the prefetching codeis extracted from the dataflow of the application itself. However,unlike helper threads (which require either spare thread contexts orspecial hardware support to run), our method inserts the prefetchercode into the application machine code directly. We find thatthe customized prefetching code sequence is usually small andits overhead of implementation is negligible compared to thesignificant performance gain we achieve with improved prefetching.
This paper makes the following contributions: • We overcome the limitations of helper threads with aninline prefetching scheme that does not require spare threadcontexts or special hardware support; • We eliminate the need for thread synchronization in helperprefetching by employing a statically-controlled prefetchdistance and a prefetcher generation process inspired fromsoftware pipelining; • For a set of irregular memory workloads, we demonstrateup to 2X end-to-end execution time speed-up on currenthigh-end hardware and up to 83% gain over helper threads.
ELATED W ORK
A full discussion of the literature on prefetching is beyond thescope of this paper; the interested reader is referred to Falsafiand Wenisch [22] and Lee et. al. [38] for thorough treatmentsof the hardware and software approaches to prefetching and thevarious challenges involved. Limiting our focus to irregular loadprefetching, prior work can be classified into three main categories:
Microarchitectural techniques use on-chip storage to recordpatterns in the addresses of the irregular load and predict a futureaddress to prefetch if the current address is from one of therecorded patterns [27], [30], [47], [49], [54], [57], [70], [72].These approaches require large on-chip storage, the cost of whichcontinues to preclude their commercial viability. The IndirectMemory Prefetcher (IMP) [72] is an exception—it uses very littleon-chip storage by targeting specific indirect memory patternsof the form a [ b [ i ]] , where the array a is addressed by a stridingfeeder load b [ i ] . At runtime, IMP records in a hardware table therelationship between the striding load and the irregular load addressand uses this to predict future addresses. The goals of our presentwork are to minimize prefetcher implementation complexity, aswell as improve performance on current hardware. Hence, we2hoose the software implementation route. Moreover, as describedin Section 3, our dataflow analysis framework is generic enoughto handle more complex patterns such as a [ f ( b [ i ])] where f is anyarbitrary function. Computation-based techniques typically execute programinstructions ahead of time to prefetch delinquent loads. Whilecomputation-based prefetching [25], [34], [48], [55] can beaccurate, large runahead buffers or spare thread contexts for runninghelper threads are resource-intensive, especially considering theirenergy costs. Since regular loads are the overwhelming majority ingeneral purpose applications, dedicating special hardware resourcesto handle comparatively rare events such as irregular loads maynot represent a good microarchitectural trade-off. In contrast, asoftware implementation is more flexible and can be invoked toincur the cost only when the benefit is known to be greater.There are other limitations to computation-based methodsbeyond just the cost of implementation. For example, runahead [48]is a technique that ignores a branch misprediction and continuesexecution to extract prefetching benefit from control independentinstructions. However, it is not designed to handle dependentmisses (i.e., misses whose addresses are data-dependent on previousmisses). Our approach handles dependent misses by prefetchingthe entire dependency chain through to the missing leaf instruction(see Section 3).Helper threads [12], [13], [15], [16], [20], [21], [31], [32], [34],[35], [41], [42], [43], [59], [67], [68], [75] extract the backwardslice of a delinquent load and run it on a spare thread context.When the latency of the backward slice is less than that of theoriginal loop, the helper thread runs ahead of the main thread andprefetches memory accessed by the main thread into the cache.This technique has the advantage of being flexible enough to beimplemented in hardware [12], [13], [15], [16], [20], [21], [23],[43], [59], [67], [75], or software [31], [32], [34], [35], [41], [42],[68]. It can also work in the absence of high-level source code andhas been demonstrated in a compiler [35], binary tool [41], or adynamic optimizer [42]. However, all prior work in this area haseither required spare thread contexts or special hardware/firmwaresupport. Virtual Multithreading (VMT) [68] overcomes the need forspare thread contexts by partitioning the registers available to thecompiler between the main and the helper computation. However, itstill requires special yield instructions to orchestrate the transfer ofcontrol between the virtual threads and corresponding modificationsto the processor firmware. In our work, by choosing an inlineimplementation, we (1) avoid the need for any extra hardwareor firmware support; (2) sidestep thread spawning overheads andsynchronization bugs since there are no threads to run; and (3) makestraightforward the rate matching between the main computationand the prefetcher by statically setting the prefetch distance.Similar to our approach, recent work [8] has proposed insertingprefetch hints based upon binary analysis. However, due to theabsence of control flow analysis, it requires specialized hardwaresupport in the form of speculative loads. Moreover, the benefitsdemonstrated are on top of a simulated microarchitecture withoutstate-of-the-art hardware prefetchers such as [63], [72]. In contrast,we show benefit on existing hardware (not requiring specialhardware support) and in comparison with state-of-the-art hardwareprefetchers.
High level language software techniques.
Broadly speaking,software-based prefetchers are typically concerned with insertingprefetch hints [4] into a program or modifying its data structures.Most [6], [14], [38], [45], [58] rely on access to the program’s source code. For instance, Roth and Sohi [58] augment the datalayout of linked data structures with a jump pointer that acts asa prefetch pointer. Others [5], [6] have demonstrated significantspeedup with programmable prefetching. However, many real-world situations preclude access to program source code, e.g., whileusing third party libraries or when serving third-party applicationsin the cloud. In such situations, the ability to improve performancein the absence of source code is attractive. This work retains suchcapability from its lineage in helper threads. Finally, while ourprefetcher generation is inspired from software pipelining [37],[56], [61], it is not a static instruction scheduling technique. Ourtarget is performance improvement over dynamically scheduled out-of-order processors that hold multiple iterations of loops in theirinstruction window. The performance improvement is exclusivelydue to the duplication of code that stays a constant number ofiterations ahead.
OTIVATING E XAMPLE
Let us now consider an example scenario where memory-boundDILs occur frequently. Hash tables are widely used because of theiralgorithmic efficiency in converting expected linear and logarithmictime operations into expected constant time operations [17]. Forinstance, they are used to implement associative arrays in popularscripting languages such as Python and R, and in relationaldatabases for indexing. However, the underlying hash functionsthat generate hash table keys are designed specifically to disruptdata locality, i.e., they are designed to enforce irregular access.Thus, when a given hash table has too many unique keys to beheld in on-chip caches, loading hash table entries can become aperformance bottleneck. typedef unsigned long UINT64; typedef unordered_map
In the previous section, we explained the problem of memory-bound DILs through a hash table example and outlined thechallenges in implementing a prefetcher with helper threads. Here,we will outline our approach to a solution, with a reminder that wewant to create a prefetcher implementation without threads.Observing the backward slice shown in Figure 3a, we see thatthe one cycle in the graph is comprised of a single instruction , i.e. , the stride address increment, and that it is the onlyloop-carried dependence in this backward slice. Note that a cyclein the backward slice captures the essential relationship betweenthe addresses produced by a DIL in successive iterations. If theinstructions in the cycle can be executed efficiently by the hardware,then it becomes possible to overcome the bottlenecks outsidethe cycle through software prefetching. Conversely, if the cyclecannot be executed efficiently by the hardware due to true datadependencies, then the performance of such a DIL cannot beimproved with software prefetching.Specifically, if the backward slice of a DIL has no cycles withany irregular memory operations, then such a cycle can be executedefficiently by the hardware. Such a cycle can be run multipleiterations ahead of the main computation by a software prefetcherand we describe DILs with such a backward slice as runnable . Onthe other hand, if the cycles in the backward slice have delinquentirregular memory operations, then running a few iterations aheadgives no advantage; the performance bottleneck would simply shiftfrom the main computation to the prefetcher computation instead.This is the classic situation of pointer chasing and we refer tosuch DILs as chasing DILs. Short of moving the whole cycle ofchasing computation closer to memory (through techniques suchas processing in memory), not much can be done to improve suchloads.To explicitly contrast runnable and chasing DILs, we providerespective examples extracted from real applications in Figure 5.4 ✁(cid:0)✂☎✆ ✝✞✟✠ ✡☛☞✌✍✎✏☛☞✂✄✁(cid:0)✂✑✆ ✝✞✟✠ ✄✁☎✡☛☞✒✓✎✏☛☞☎✄✁(cid:0)✔✒✆ ✁✞☞✕ ☛✖✗✁✏☛✖✗✁✄✁(cid:0)✔✘✆ ✝✞✟✠ ☛☞✂✏☛☞✔✁✄✁(cid:0)✔(cid:0)✆ ✗✙✟✠ ☛☞☎✄✁(cid:0)✔✂✆ ✝✞✟✠ ✡☛☞✒✓✎✏☛☞✔✁✄✁(cid:0)✔✗✆ ✝✞✟✠ ✡☛☞✔✁✏☛☞✗✁✏☎✎✏☛☞✔✁✚✛✜✢✣✤ ✥✦✧★ ✩✪✫✛✬✩✪✣✚✚✛✜✢✭✤ ✮✯✰✮★ ✩✪✱✛✬✩✪✱✛✚✛✜✢✲✤ ✳✯ ✚✛✜✲✭✄✁(cid:0)✌✂✆ ✝✞✟✠ ✡☛☞✔✁✎✏☛☞✑✁✄✁(cid:0)✌✑✆ ✝✞✟✠ ✄✁☎✡☛☞✑✁✎✏☛☞✴✙✚✛✜✵✚✤ ✵✥✶★ ✩✪✰✷✬✩✪✸✚✛✜✵✹✤ ✳✺✯ ✚✛✜✻✼✚✛✜✵✻✤ ✥✦✧★ ✩✪✢✶✬✩✪✰✷✚✛✜✵✼✤ ✥✦✧★ ✩✪✣✽✬✩✪✫✷✚✛✜✵✢✤ ✱✫✫★ ✾✚✛✣✬✩✪✣✹✄✁(cid:0)✑✿✆ ✔✗✗✠ ❀✄✁☎✏☛☞✌✍✚✛✜✫✭✤ ✵✱❁❁ ✚✛❂✜✚✚✛✜✫✼✤ ✱✫✫★ ✾✚✛✣✬❃✩✪✱✛❄✚✛✜✫✵✤ ✵✥✶★ ✩✪✣✹✬✩✪✢✛✚✛✜✫❂✤ ✳✺✯ ✚✛✜✸✼ (a) Disassembly. (cid:0)✁✂✁✄☎✆✝ ✞✟✠✡☛☞✠☛ ☞✌✌✍✎✡✡ ✏✟☞✌✡✝ ✑☛✍✒✌✒✠✓ ✏✟☞✌✡✝ ✔☛✕✎✍ ✒✍✍✎✓✖✏☞✍ ✏✟☞✌✡✝ ✞✍✒☛✒✗☞✏ ✘✙✚✛✜✢☎✝ ✘☞☛☞✣✏✟✤ ✏✎☞✌✒✠✓ ☛✟ ☛✕✎ ✘✙✚ ✥✦☞✗✧✤☞✍✌ ✡✏✒✗✎★ (b) Dataflow.
Fig. 3: Disassembly and dataflow of the hot loop in Listing 1. (cid:0)✁(cid:0)(cid:0)(cid:0)✁✂(cid:0)(cid:0)✁✄(cid:0)(cid:0)✁☎(cid:0)(cid:0)✁✆(cid:0)✝✁(cid:0)(cid:0)✝✁✂(cid:0)✝✁✄(cid:0)✝✁☎(cid:0) ✞✟✠✡☛ ☞ ✌ ✍✎ ✞✟✠✡☛ ✍ ✌ ✍✎ ✞✟✠✡☛ ☞ ✌ ✏✑✑ ✞✟✠✡☛ ✍ ✌ ✏✑✑✒✓✔✔✕✖✓ ✗✘✙✚✛✜✢✣✤✥ ✗✘✙✚✛✢✤✥ ✗✘✙✚✛✦✧✣✤✥ ✗✘✙✚✛✤★✥✗✘✙✚✛✩✢✣✤✥ ✗✘✙✚✛✧✤✥ ✗✘✙✚✛✪✧✣✤✥ ✗✘✙✚✛✜★★✥
Fig. 4: Performance results from the helper thread implementation.The two plots on the left are from when the main and helper threadsare restricted to SMT contexts in the same core. The two plots onthe right are obtained by allowing the threads to be scheduled onany of the available cores.The coloring scheme remains the same as in Figure 3. The backwardslice of the runnable DIL shown in Figure 5a has three totalcycles, but none have irregular memory operations. In contrast, thebackward slice of the chasing DIL shown in Figure 5b has twocycles, one of which has an irregular memory operation at (yellow).Through dataflow analysis, we can determine if a DIL isrunnable. However, merely being runnable does not guaranteethat a DIL is also prefetchable. We must also examine the controlflow within the loop. If the backward slice of a DIL varies alongthe different control flow paths through the loop, then the backwardslice is control dependent on the branches within the loop. Apopular example of such a situation occurs in an array-basedimplementation of a binary search tree. If the current search node is at index x , the next node to be searched can either be the left child(at index 2 x +
1) or the right child (at index 2 x + k iterations ahead, then there are 2 k possible addressesto prefetch. We have the option of either prefetching all of thoseaddresses or implementing a software-based branch predictor toselect one of the addresses to prefetch. Both of these optionsare unrealistic and hence we deliberately exclude such situationsby considering only DILs that have backward slices that remain control independent of all the branches within the loop. Finally,when a DIL is runnable as well as control independent , we callit prefetchable . These two criteria comprise our DIL screen; oursoftware prefetcher framework only targets DILs that pass thisscreen for custom prefetching code generation.Once a prefetchable DIL has been identified, inspired bysoftware pipelining [37], [56], we take a carrot and the horse approach to prefetching it. We duplicate the backward slice codeand assign new registers to it. By analogy, this code is the “carrot”and the main computation is the “horse”. Prior to the entry intothe loop, the carrot is first extended k iterations ahead of the horse.We call this phase in the dynamic execution the head start phase.After the entry into the loop, the carrot locks steps with the horseand stays a constant k iterations ahead. We call this phase in thedynamic execution the stay ahead phase. During the last k iterationsof the loop, the carrot ceases to stay ahead and merges with thehorse. We call this phase of dynamic execution the join phase.Finally, since the carrot overwrites the architectural registers, wealso need to save them onto the stack at loop entry and restore them at loop exit.This process is more formally described in Figure 6. Thefigure contrasts the dynamic instruction streams and the memoryaddresses accessed before and after the insertion of the softwareprefetcher code. At the top, each iteration in the original instructionstream has a DIL (marked DIL i ) that demands a particular memory5 (a) A Runnable DIL. (b) A Chasing DIL. Fig. 5: Examples of a Runnable and a Chasing DIL. The runnable DIL has three cycles, but no irregular memory operations are part ofthese cycles. In contrast, the chasing DIL has two cycles and one of them has an irregular memory operation ( ). D I L D I L n D I L n - k D I L D I L k . . . . . . D I L D I L n D I L n - k D I L D I L k . . . . . . C P C P C P k C P n - k D I L k + D I L n - k - C P n - k - Head Start Stay Ahead . . . . . .
Join
Save Restore
MemoryOriginal InstructionStreamOriginal +
Custom
PrefetcherIterations 1 2 k n-k nk+1
Fig. 6: Overview of the phases in our prefetching scheme. DILs(marked DIL i ) at each loop iteration demand particular memoryaddresses. We insert customized prefetching code (yellow) thatruns k iterations ahead to prefetch those addresses and mitigate thedelinquency.address. At the bottom, customized prefetching code (yellow) isinserted into the instruction stream. These are given a head start torun k iterations ahead such that the addresses they prefetch mitigateall the DILs within stay ahead and join phases. Please note thatalthough the carrot and horse approach sounds similar in principleto software pipelining, it is not a instruction scheduling techniqueand the speedups are exclusively because of the duplication of codethat stays a constant number of iterations ahead.With this overall picture in mind, we provide the details of ourmethod next. The first step is the identification of DRAM-bound load instructions.For this purpose, we employ detailed profiling and dataflow analysisof the application of interest. Our analysis infrastructure uses apintool [44] to generate the basic block vector profiles of theapplication at a 10M instruction granularity and the SimPoint [62]methodology to identify representative regions for microarchitec-tural simulation. We implement K-means clustering and augmentit with silhouette analysis [60] to ensure clusters of good quality.We then use PinPlay [53] to generate two traces for each SimPoint,a short trace for functional simulation and dataflow analysis anda long trace for cycle-accurate microarchitectural analysis. The short functional simulation traces are 10M instructions long. Thelong microarchitectural traces are 310M instructions long in orderto accommodate in simulation a cache warmup period of 295Minstructions, microarchitectural warmup period of 5M instructions,and a detailed cycle-accurate simulation of 10M instructions.Next, we perform cycle-accurate simulations of a microar-chitecture resembling Intel’s Skylake [18] CPU on an in-housex86 64 performance simulator. The cycle simulations producea list of DRAM-bound load instructions, defined as those withan average CPI higher than the latency of the last level cache.This output list is sorted by the fraction of the total L1 datacache misses produced by each load instruction. We then selectthe delinquent load instructions covering the top 99% of all theL1 data cache misses for further dataflow analysis. It is worthnoting that we chose this route of implementation through trace-level, cycle-accurate microarchitectural simulation, but there areother ways to identify DRAM-bound load instructions, e.g. , withassistance from hardware performance counters [19], functionalcache simulation [28], profiling DRAM accesses [10], [66] or evenstatically [52].The next step in our analysis is to identify the irregular loadsfrom the list of DRAM-bound loads. We achieve this through thecalculation of address deltas, defined as the numerical differencebetween the addresses produced by successive executions of aload instruction. We compute the address delta histograms forall the DRAM-bound loads in the short traces. An n-dimensionalregular array accessed inside a loop can produce n different addressdeltas. Hence, in order to filter out high dimensional regular arrayscommon in numerical code, we choose a threshold of 10 deltas, i.e. , we select only those load instructions with at least 10 distinctaddress deltas covering the top 90% of the executions. This is ourDIL candidate list.We then build the dynamic control flow graph [7] using theshort traces and determine the loop immediately encompassing eachDIL. After that, we enumerate all the different control flow pathswithin the loop. For each such path, we perform dynamic dataflowanalysis [36] to compute the backward slice graph and enumerateall the simple cycles in it [29]. With the information from theaforementioned address delta analysis, we find the cycles that donot involve any irregular memory operations and determine whetherthe DIL is runnable. When a runnable DIL has the same backwardslice along all the control flow paths within its encompassingloop, we flag it as control independent and hence prefetchable. For6hese graph computations, we utilize the networkx [24] package inPython.Once all the prefetchable DILs and their encompassing loopsare identified, we group the DILs into loops and determine whichamong them inside the same loop produce addresses that are at asmall constant offset from one another. We drop all such DILs fromour list except the DIL with the largest average CPI (the criticalDIL) since such addresses either fall within the same cache line asthe critical DIL or regular hardware prefetchers will handle theseproperly. The load instruction at address in Figure 3 isan example of such a case. Moreover, to avoid alias analysis, werestrict ourselves to situations where the addresses of the stores inthe backslice can be inferred statically. Through these successivescreens, we are ultimately left with only those prefetchable DILsthat are most challenging for the hardware to prefetch. We now illustrate the generation of the customized prefetchingcode for the phases shown in Figure 6, using the hash table examplefrom Figure 3. Keep in mind that we do not operate on the sourcecode and hence begin with the loop shown in Figure 3a. We insertthe prefetcher assembly into the application’s assembly directly.As a first step, we attempt to find unused architectural registersinside the loop. When there are no unused registers available, wecreate new local variables on the stack and select registers to spillonto them in the following order for minimal performance impact:1) Registers only written to but never read from inside theloop (only the last write to these registers need to be madevisible outside the loop);2) Registers only read from but never written to inside theloop (all references to these registers will be replaced bytheir corresponding stack loads).For our example in Figure 3a, it turns out that registers r r
14 and r
15 are unused inside the loop. Among these, r
11 is caller-saved and there is a function call inside the loop, meaning it couldpotentially be used inside the function call. Thus, we choose r r
15 as the registers to use for our carrot computation i.e., insidethe customized prefetching code. As discussed before and shownin Figure 6, the first step is to save these registers onto the stack: pushq %r14 pushq %r15 Listing 2: The save phase.Next is the head start phase, also performed at loop entry,where the prefetch computation gets a k -iteration head start. Inour example, rbp is the only register written inside the cycle inthe backward slice graph. Hence, we duplicate it onto r
14. Wealso use r
15 as scratchpad to perform the loop boundary check bycomparing it with the loop limit in rbx , as follows. .set k, 8 .set logk, 3 movq %rbp, %r14 movq $k, %r15 addq $0x1, %r15 cmpq %rbx, %r15 jge SKIP1 movq $0x8, %r15 shlq $logk, %r15 addq %r15, %r14 SKIP1:
Listing 3: The head start phase.The next two phases are the (1) stay ahead phase, where theprefetcher (carrot) computation stays ahead of and in lock step withthe main (horse) computation, and (2) the join phase, where theprefetcher computation no longer stays ahead and ultimately mergeswith the main computation. Both of these phases are inserted intothe loop body and are shown in Listing 4. For clarity, we distinguishthe inserted code from the existing code by highlighting the insertedcode in yellow. START: movq (%r14),%r15 movq (%rbp),%r9 movq 0x8(%r12),%r8 xorl %edx,%edx movq %r15,%rax divq %r8 movq %rdx, %r15 xorl %edx,%edx movq %r9,%rax divq %r8 movq (%r12),%rax movq (%rax,%r15,8),%r15 movq (%rax,%rdx,8),%rax movq %rdx,%r10 testq %rax,%rax je LABEL1 testq %r15, %r15 je SKIP2 movq (%r15),%r15 prefetcht0 0x8(%r15) SKIP2: movq (%rax),%rcx movq 0x8(%rcx),%rsi cmpq %rsi,%r9 jne LABEL2 movq %rbp,%rsi movq %r12,%rdi addq $0x1,%r13 addq $0x8,%rbp addq $0x8, %r14 movq %rbx, %r15 subq $k, %r15 cmpq %r15, %r13 jl SKIP3 movq %rbp, %r14 SKIP3: call 0xf60 addq $0x1,(%rax) cmpq %r13,%rbx jne START Listing 4: The stay ahead and join phases.The last step is to restore the saved registers at all exit pointsof the loop. popq %r15 popq %r14 Listing 5: The restore phase.After the insertion of the prefetcher code, to ensure correctness,we compare the output of the optimized version to that of theunoptimized version and require that they match exactly, except7or those outputs dependent on operating system behavior such astiming measurement, random number generation, signal handling, etc . XPERIMENTAL E VALUATION
Recall from figure 1 that while DIL prefetching may not benefitall applications, some irregular applications can benefit a lot (rightside of figure 1). For instance, several high value cloud applicationsfall into this category. Hence, we evaluate our proposal on a set ofirregular memory workloads similar to the work by Ainsworth andJones [6] (we do not use the applications from figure 1 since wedon’t have access to their binaries)We study three applications from their work that are bottle-necked by DRAM-bound DILs and add two more to the evaluationincluding the hash table example we discussed in Sections 3and 4. Since our focus is on single thread performance, we utilizethe serial versions of the benchmarks for experimentation. Wecompile all benchmarks with gcc 6.1.0 using the flags -O3-march=native on an Intel Xeon E5 server CPU. We run allthe analysis tools for prefetchable DIL identification and generatethe customized prefetching code on the same server as well.
We now provide a brief overview of the applications studied. • STLHistogram is the example we discussed in Sections 3and 4. It generates a random array of integers and computesthe frequency histogram of the array using C++ STLunordered map. It takes the size of the array and the numberof unique elements in it as arguments. Microarchitecturalperformance of this application suffers when neither the in-put array, nor the frequency histogram fits inside the on-chipcaches. We choose this benchmark due to the popularity ofhash tables in programs and the potential for customizedprefetching to improve performance. Please note that sinceopen address hash tables are popular, we also studied apolicy based implementation of STLHistogram. While thebaseline performance of this new version was 7X better thanthe unordered map version, the performance improvementopportunity was very similar to the unordered map versionwith a single prefetchable memory bound DIL causingmost of the stalls. Hence, we report results only for theunordered map version. • PageRank is an implementation of the popular web-pagerelevance ranking algorithm [50] using the C++ BoostGraph Library [2] (BGL). It is a graph algorithm that ranksa website based upon the ranks of the websites that link toit. • HashJoin [11] from the University of Wisconsin imple-ments the join operation of a relational database [64] inmain memory using hash tables. The join operation is verycommon in Structured Query Language (SQL) queries. • Graph500CSR is part of the Graph500 [46] benchmarksuite designed to rate supercomputer systems on their data-intensive performance. It performs Breadth-First Search(BFS) on a large graph implemented using a compressedsparse rows (CSR) data structure. • Cuckoo [73] is an application modeling packet processingin the context of Network Function Virtualization (NFV)using the cuckoo hashing algorithm [51]. We run the sequential versions of these applications on theinputs shown in Table 1 and generate traces as discussed inSection 4.1. An automatic tool analyzes the traces to producethe list of prefetchable DILs, the loops they belong to, and alist of available registers for code generation. The customizedprefetching code is then generated semi-automatically with manualintervention. Specifically, our scripts generate a skeletal prefetchercode with the duplicated backslice and a list of candidate registers.However, register fills/spills, null-pointer skips and handling slicesacross function calls are done manually, Another automatic toolthen statically rewrites the original function in the binary with adynamic version that allocates the optimized code in the heap andcalls it through a function pointer. We then run the optimized binaryto ensure that its output matches the original. For performancemeasurement, we employ an Intel Core i9-7900X Skylake CPUwith all the hardware prefetchers enabled, running at 3.3 GHz andfrequency scaling disabled in the BIOS. We choose an evaluationsystem that is different from the one used for compilation tosimulate a binary-only scenario. We run the applications five timeseach and record the median wall clock time before and afteroptimization. We also measure the dynamic instruction overheadof the optimized versions using a pintool [44]. The last columnof Table 1 shows the dynamic instruction counts of the maincomputation in the original applications.
Benchmark Input Dynamic In-str (B)
STLHistogram 100M array, 10Munique elements 7.9PageRank [2], [50] web-Google.txt [40] 1.1HashJoin [11] 016M build.tbl,256M probe.tbl 55.8Graph500CSR [46] -s 18 -e 10 11.2Cuckoo [51], [73] 8M flows 10.2
TABLE 1: Benchmarks and inputs (Input 1).
First, we present the results of the control and dataflow analysesfor the applications.
Benchmark DILs Prefetch-ableDILs Loops FunctionName
STLHistogram 4 3 1 gen histoPageRank 4 4 2 pagerankHashJoin 2 2 1 realprobeCursorGraph500CSR 6 6 2 make bfs treeverify bfs treeCuckoo 3 2 1 rte hash lookupbulk data
TABLE 2: Results of control and dataflow analyses.The data in Table 2 shows that of the 19 total DILs, 17 areprefetchable. We proceed with the performance evaluation of theprefetchers for these DILs.
For the five applications, we vary the prefetch distance fromtwo iterations to 256 iterations in powers of two. Note that we8 ✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩ (a) Speedup. (cid:0)✁✂(cid:0)✁✄☎✁(cid:0)☎✁☎☎✁✆☎✁✝☎✁✞☎✁✟☎✁✠ ✆ ✞ ✂ ☎✠ ✝✆ ✠✞ ☎✆✂ ✆✟✠✡☛☞✌✍✎✏✑☞✒✓✔✕✏✓✎✖☞✗✘✙✔✚✙✌✛ ✜✢✣✤✣✥✦✧ ★✩✪✥✫✬✦✣ ✭✩✥✣✢✫✥✩✮✬✪✯✰✱✲✳✴✵✶✷✸✹✺✻ ✼✺✸✽✾✺✿❀ ✳✺✵❁❂✷✴✿ ❃✹✺❄❁❅❆❆❇✰✾ ❇❈❉❀✷✷ (b) Dynamic Instruction Overhead.
Fig. 7: Performance of the prefetching scheme (Input 1).choose powers of two for only a minor convenience in codegeneration since multiplication can be replaced with shifts; itis not a fundamental restriction in our approach and can easilybe changed to accommodate any arbitrary lookahead. We thenverify that the outputs of the optimized binaries match with theoriginal ones and then measure the performance of the optimizedversions. The speedup from the performance optimization is shownin Figure 7a. The corresponding dynamic instruction overhead isshown in Figure 7b. The x -axis on both the figures is the prefetchdistance, which is the number of iterations of lookahead availablefor the prefetcher. On the y -axis in Figure 7a is the ratio of themedian wall clock time before optimization to that after. Figure 7bplots on its y -axis the ratio of the total dynamic instructions ofthe baseline to that of the the optimized executions. Note that thespeedups reported include dynamic instruction overhead since wemeasure wall clock time.
For the applications and inputs described in Table 1, there isa significant speedup of 21-94% due to our software prefetchers.This speedup is in spite of significant dynamic instruction overheadin some cases. Hence, this result clearly demonstrates that we aresuccessful in accurately prefetching the critical load addresses in amanner that does not interfere with the memory bandwidth or withany hardware prefetchers.A pattern to observe in the data is that even with only a fewiterations of the prefetch distance lookahead, the performanceincreases significantly. In fact, except for PageRank and Cuckoo,the performance improvement is stable across the entire range ofprefetch distances. This is because the loop sizes are such that onlya few iterations fit in the dynamic instruction window of the CPU.Hence, even with a small lookahead, the prefetcher reaches outsidethe instruction window to be effective. However, the behavior ofPageRank and Cuckoo deserve further explanation.PageRank operates on the Web-Google graph dataset [40],which has an average degree of less than six. The inner loopencompassing the prefetchable DIL iterates over all the neighborsof a graph node. Hence, the trip count of this loop is equal tothe average number of a node’s neighbor or its average degree.Therefore, prefetch distances longer than six skip the loop fully anddo not help much. This behavior can also be seen in Figure 7b in the dynamic instruction overhead data. A similar behavior occursin Cuckoo as well, where the prefetchable DILs are from an innerloop with a fixed iteration count of 32. The lost opportunity costdue to small iteration counts is the reason for the reduction inperformance with increasing lookahead.Note that the performance gains for STLHistogram andHashJoin are much higher than those for the remaining three. In theformer two, the critical DIL is fed by a strided load after passingthrough a hash function and multiple indirections. However, in thelatter three, the strided load feeds the DIL directly through fewerindirections (and a hash function in Cuckoo). Thus, as discussed inSections 3 and 4, the bottleneck of the chain of dependent cachemisses is much larger for the former applications than the latter.Consequently, the performance boost obtained by mitigating themis also higher.5.2.2.1 Impact of Inputs: Next, we select a set of largerinputs for our applications and run the optimized binaries on thisset to study sensitivity to different application inputs. Table 3 liststhe new inputs used for this experiment.
Benchmark Input DynamicInstr (B)
STLHistogram 200M array, 10Munique elements 12.7PageRank cit-Patents.txt [39] 4.2HashJoin 032M build.tbl,512M probe.tbl 148.2Graph500CSR -s 21 -e 10 90.7Cuckoo 16M flows 20.5
TABLE 3: Alternative inputs for the optimized benchmarks (Input2). Figure 8 displays the speedup and the dynamic instruction over-head for the optimized binaries running on these new inputs. Wecan see that the speedup has improved for STLHistogram, stayedabout the same for HashJoin/Cuckoo and decreased for PageRankand Graph500CSR. Overall, the speedups range from 10%-100%and are still significant over the baselines. For PageRank, the cit-Patents dataset [39] has an average degree of 4 . ✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩ (a) Speedup. (cid:0)✁✂(cid:0)✁✄☎✁(cid:0)☎✁☎☎✁✆☎✁✝☎✁✞☎✁✟☎✁✠ ✆ ✞ ✂ ☎✠ ✝✆ ✠✞ ☎✆✂ ✆✟✠✡☛☞✌✍✎✏✑☞✒✓✔✕✏✓✎✖☞✗✘✙✔✚✙✌✛ ✜✢✣✤✣✥✦✧ ★✩✪✥✫✬✦✣ ✭✩✥✣✢✫✥✩✮✬✪✯✰✱✲✳✴✵✶✷✸✹✺✻ ✼✺✸✽✾✺✿❀ ✳✺✵❁❂✷✴✿ ❃✹✺❄❁❅❆❆❇✰✾ ❇❈❉❀✷✷ (b) Dynamic Instruction Overhead. Fig. 8: Prefetcher performance on different input data (Input 2).the previous Web-Google dataset. Thus, as discussed earlier, thedrop in its speedup can be attributed to the reduced trip count ofits inner loop. As for Graph500CSR, the new input has a highernumber of vertices but the same average degree as before and theperformance contribution of the DILs is lower than before. Hence,the corresponding speedup by prefetching them is also lower.5.2.2.2 Impact of Microarchitecture: The results shown sofar were for a single microarchitecture. To study the impact of a dif-ferent microarchitecture, we generate traces from the unoptimizedand optimized binaries and perform cycle accurate simulations onthem for an aggressive hypothetical microarchitecture that is 2Xwider and 3.5X deeper than Skylake. We also model two aggressivehardware prefetchers similar to VLDP [63] and IMP [72] since theywere published after the release of the Skylake microarchitecture.Figure 9 shows the result of the experiment. Unlike Skylake, thedynamic instruction window of the hypothetical microarchitecturecan hold many more iterations of the loops. Hence, short prefetchdistances do not go beyond the instruction window. This is whythe speedups are lower for shorter lookaheads (for the benchmarkswithout small loop iteration counts). However, once the prefetchdistances are sufficient to look beyond the instruction window, thespeedups stabilize afterwards. The extra latency hiding offeredby the 3X increase in out-of-order depth causes the DILs fromPagerank to not be the bottlenecks of performance anymore. Hence,the instruction overhead for the benchmark shows up as a slowdownin the chart. Nevertheless, the speedups continue to be signifcantoverall(the bar for STLHistogram is missing for the prefetchdistance of 128 due to simulation failure).The stability of speedups across prefetch distances beyond aparticular threshold is helpful in case of variable DRAM latencies.Setting the lookahead for the worst-case memory latency canprovide speedups that are robust to the variability. Moreover, thefact that the speedups remain significant even under contemporaryaggressive hardware prefetchers, emphasizes that our approach iscomplementary to hardware and minimizes interference.5.2.2.3 Comparison with Helper Threads: We now com-pare the inline prefetcher to traditional helper threads. To providethe techniques with equal hardware, we restrict the helper threadimplementations to one additional SMT context from the same (cid:0)✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩
Fig. 9: Prefetcher performance on a hypothetical microarchitecturethat is 2X wider, 3.5X deeper and includes aggressive prefetcherssimilar to VLDP [63] and IMP [72]core as the main thread. We also select the best tuning parameters(prefetch distance for the inline prefetcher and launch trigger/fre-quency for helper threads) for both the schemes. Figure 10 showsthe results of the experiment.Our inline prefetcher outperforms helper threads due to thelatter’s thread spawning overhead. Dropping the outlier (Cuckoo),the speedups range from 13-83%, which is significant. For Cuckoo,the number of thread spawns is prohibitive for helper threading tobe competitive. From the results of these experiments, we concludethat the proposed prefetcher scheme is accurate in targeting thecritical load instructions and improves single thread performanceof the targeted applications significantly. It does so without therequirements of traditional helper threading such as idle threadcontexts and special support from hardware or firmware.
As a binary modification technique, debuggability can be affecteddue to optimization. Hence, it is a good idea to restrict optimization10 ✄(cid:0)☎✄✁✂☎✆✝✞✆✝✟✆✝✠✆✝✡✆✝☛✆✆✝ ☞✌✍✎✏✑✒✓✔✕✖✗ ✘✖✔✙✚✖✛✜ ✎✖✑✢✣✓✏✛ ✤✕✖✥✢✦✆✆✧☞✚ ✧★✩✜✓✓✪✫✬✭✫✮✯✰✱✲✬✳✴✫✱✫✮✯✴✳✫✬✵✫✶✲✫✬✷✸✬✫✹✺✻ ✼✽✾✿❀ ❁✼✽✾✿❀ ❂
Fig. 10: Percent improvement of inline prefetcher over helperthreads.only to performance critical code.As a prefetching scheme running on the CPU, we drop allthe pointer chasing loads from the purview of our optimization.Such a restriction is not essential. The backward slices andcycles with chasing loads are ideal for offload into Processing-In-Memory (PIM). Future work could explore means of implementingsuch offloading. Also, we have restricted ourselves to softwareimplementation on existing hardware, which is not mandatory.The profile-based, offline dataflow analysis could advice hardware-software co-design and prefetchers could be implemented in customhardware instead. With the advent of Field Programmable GateArrays (FPGA), custom hardware prefetchers closely coupled witha processors pipeline are another potential direction of investigation.
ONCLUSION
In this paper, we have described an inline software prefetcher forDRAM-bound Delinquent Irregular Loads (DILs). In order to avoidinterfering with the hardware prefetchers for regular loads and tokeep the bandwidth impact and cache pollution to a minimum,we have designed the scheme to be highly selective in targetingonly the DILs most difficult for the hardware to prefetch. In spiteof being selective, our approach has a significant potential forperformance enhancement as demonstrated by four applicationsfrom different domains: a C++ hash table implementation, thePageRank website ranking algorithm, a database join algorithmand the Graph500 breadth-first search of a large graph. Across allinputs to the test applications, speedup due to our inline prefetchersranged from 10% to 100% on a high-end Intel Skylake system.Our approach performs better than a traditional implementationof helper threads due to the latter’s thread spawning overhead.It does so while still not requiring separate thread contexts orspecial hardware/firmware support. It makes the implementationand debugging of the helper easier since it avoids explicit synchro-nization and stays a constant number of iterations ahead of themain computation, As a software approach that does not requirehigh level source code, it can be attractive for third party cloudapplications. R EFERENCES
The Boost Graph Library: User Guide and Reference Manual . Boston,MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2002.[3] “Intel C++ Compiler Professional Edition 11.1 for Linux Installa-tion Guide and Release Notes,” https://software.intel.com/content/dam/develop/external/us/en/documents/release-notesc-l-en-779487.pdf, Sep.2009, [Online; accessed 18-June-2020].[4] A.-R. Adl-Tabatabai, R. L. Hudson, M. J. Serrano, and S. Subramoney,“Prefetch injection based on hardware monitoring and object metadata,”in
Proceedings of the ACM SIGPLAN 2004 Conference on ProgrammingLanguage Design and Implementation , 2004.[5] S. Ainsworth and T. M. Jones, “Software prefetching for indirect memoryaccesses,” in
Proceedings of the 2017 International Symposium on CodeGeneration and Optimization , 2017.[6] ——, “An event-triggered programmable prefetcher for irregular work-loads,” in
Proceedings of the Twenty-Third International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 2018.[7] F. E. Allen, “Control flow analysis,” in
Proceedings of a Symposium onCompiler Optimization , 1970.[8] G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, “Classifying memoryaccess patterns for prefetching,” in
Proceedings of the Twenty-FifthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2020.[9] G. Ayers, N. P. Nagendra, D. I. August, H. K. Cho, S. Kanev, C. Kozyrakis,T. Krishnamurthy, H. Litz, T. Moseley, and P. Ranganathan, “Asmdb:Understanding and mitigating front-end stalls in warehouse-scale comput-ers,” in
Proceedings of the 46th International Symposium on ComputerArchitecture , 2019.[10] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu,“Hmtt: A platform independent full-system memory trace monitoringsystem,”
SIGMETRICS Perform. Eval. Rev. , vol. 36, no. 1, Jun. 2008.[11] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main memoryhash join algorithms for multi-core cpus,” in
Proceedings of the 2011ACM SIGMOD International Conference on Management of Data ∼ jignesh/multijoin.tar.bz2.[12] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt,“Simultaneous subordinate microthreading (ssmt),” in Proceedings of the26th Annual International Symposium on Computer Architecture , 1999.[13] R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt, “Microarchitecturalsupport for precomputation microthreads,” in
Proceedings of the 35thAnnual ACM/IEEE International Symposium on Microarchitecture , 2002.[14] T. M. Chilimbi and M. Hirzel, “Dynamic hot data stream prefetchingfor general-purpose programs,” in
Proceedings of the ACM SIGPLAN2002 Conference on Programming Language Design and Implementation ,2002.[15] J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, “Dynamic spec-ulative precomputation,” in
Proceedings of the 34th Annual ACM/IEEEInternational Symposium on Microarchitecture , 2001.[16] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery,and J. P. Shen, “Speculative precomputation: Long-range prefetchingof delinquent loads,” in
Proceedings of the 28th Annual InternationalSymposium on Computer Architecture , 2001.[17] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction toAlgorithms
Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis , 2016.[20] M. Dubois, “Fighting the memory wall with assisted execution,” in
Proceedings of the 1st Conference on Computing Frontiers , 2004.[21] M. Dubois and Y. Song, “Assisted execution,”
University of SouthernCalifornia CENG Technical Report , vol. 98, no. 25, 1998.[22] B. Falsafi and T. F. Wenisch,
A Primer on Hardware Prefetching . Morgan& Claypool Publishers, 2014.[23] A. Garg and M. C. Huang, “A performance-correctness explicitly-decoupled architecture,” in
Proceedings of the 41st Annual IEEE/ACMInternational Symposium on Microarchitecture , 2008.
24] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure,dynamics, and function using networkx,” in
Proceedings of the 7th Pythonin Science Conference , 2008.[25] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous runahead: Transparenthardware acceleration for memory intensive workloads,” in
The 49thAnnual IEEE/ACM International Symposium on Microarchitecture , 2016.[26] U. Holzle, “Brawny cores still beat wimpy cores, most of the time,”
IEEEMicro , 2010.[27] A. Jain and C. Lin, “Linearizing irregular memory accesses for improvedcorrelated prefetching,” in
Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture
SIAM Journal on Computing , vol. 4, no. 1, 1975.[30] D. Joseph and D. Grunwald, “Prefetching using markov predictors,” in
Proceedings of the 24th Annual International Symposium on ComputerArchitecture , 1997.[31] C. Jung, D. Lim, J. Lee, and Y. Solihin, “Helper thread prefetching forloosely-coupled multiprocessor systems,” in
Proceedings of the 20th IEEEInternational Parallel Distributed Processing Symposium , 2006.[32] M. Kamruzzaman, S. Swanson, and D. M. Tullsen, “Inter-core prefetchingfor multicore processors using migrating helper threads,” in
Proceedingsof the Sixteenth International Conference on Architectural Support forProgramming Languages and Operating Systems , 2011.[33] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley,G.-Y. Wei, and D. Brooks, “Profiling a warehouse-scale computer,” in
Proceedings of the 42nd Annual International Symposium on ComputerArchitecture , 2015.[34] D. Kim, S. S. wei Liao, P. H. Wang, J. del Cuvillo, X. Tian, X. Zou,H. Wang, D. Yeung, M. Girkar, and J. P. Shen, “Physical experimentationwith prefetching helper threads on intel’s hyper-threaded processors,” in
Proceedings of the International Symposium on Code Generation andOptimization: Feedback-directed and Runtime Optimization , 2004.[35] D. Kim and D. Yeung, “Design and evaluation of compiler algorithmsfor pre-execution,” in
Proceedings of the 10th International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 2002.[36] B. Korel and J. Laski, “Dynamic program slicing,”
ACM Transactions onArchitecture and Code Optimization , vol. 29, no. 3, Oct. 1988.[37] M. Lam, “Software pipelining: An effective scheduling technique for vliwmachines,” in
Proceedings of the ACM SIGPLAN 1988 Conference onProgramming Language Design and Implementation , 1988.[38] J. Lee, H. Kim, and R. Vuduc, “When prefetching works, when it doesn’tand why,”
ACM Transactions on Architecture and Code Optimization ,vol. 9, no. 1, Mar. 2012.[39] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: Densifica-tion laws, shrinking diameters and possible explanations,” in
Proceedingsof the Eleventh ACM SIGKDD International Conference on Knowl-edge Discovery in Data Mining , 2005, http://snap.stanford.edu/data/cit-Patents.html.[40] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Communitystructure in large networks: Natural cluster sizes and the absence of largewell-defined clusters,”
Internet Mathematics , vol. 6, no. 1, Jan 2009,http://snap.stanford.edu/data/web-Google.html.[41] S. S. Liao, P. H. Wang, H. Wang, G. Hoflehner, D. Lavery, andJ. P. Shen, “Post-pass binary adaptation for software-based speculativeprecomputation,” in
Proceedings of the ACM SIGPLAN 2002 Conferenceon Programming Language Design and Implementation , 2002.[42] J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham, “Dynamichelper threaded prefetching on the sun ultrasparc cmp processor,” in
Proceedings of the 38th Annual IEEE/ACM International Symposium onMicroarchitecture , 2005.[43] C.-K. Luk, “Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors,” in
Proceedingsof the 28th Annual International Symposium on Computer Architecture ,2001.[44] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace,V. J. Reddi, and K. Hazelwood, “Pin: Building customized programanalysis tools with dynamic instrumentation,” in
Proceedings of the2005 ACM SIGPLAN Conference on Programming Language Design andImplementation , 2005.[45] C.-K. Luk and T. C. Mowry, “Compiler-based prefetching for recursivedata structures,” in
Proceedings of the Seventh International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 1996. [46] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducingthe graph 500,” Cray User’s Group, May 2010.[47] O. Mutlu, Hyesoon Kim, and Y. N. Patt, “Address-value delta (avd)prediction: increasing the effectiveness of runahead execution by ex-ploiting regular memory allocation patterns,” in , 2005.[48] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for efficient processingin runahead execution engines,” in
Proceedings of the 32nd AnnualInternational Symposium on Computer Architecture , 2005.[49] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a globalhistory buffer,” in
Proceedings of the 10th International Symposium onHigh Performance Computer Architecture , 2004.[50] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citationranking: Bringing order to the web,” Tech. Rep. 1999-66, November 1999.[51] R. Pagh and F. F. Rodler, “Cuckoo hashing,”
J. Algorithms , vol. 51, no. 2,May 2004.[52] V.-M. Panait, A. Sasturkar, and W.-F. Wong, “Static identification ofdelinquent loads,” in
Proceedings of the International Symposium onCode Generation and Optimization: Feedback-directed and RuntimeOptimization , 2004.[53] H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie, “Pinplay: Aframework for deterministic replay and reproducible analysis of parallelprograms,” in
Proceedings of the 8th Annual IEEE/ACM InternationalSymposium on Code Generation and Optimization , 2010.[54] L. Peled, S. Mannor, U. Weiser, and Y. Etsion, “Semantic locality andcontext-based prefetching using reinforcement learning,” in
Proceedingsof the 42nd Annual International Symposium on Computer Architecture ,2015.[55] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A study of slipstreamprocessors,” in
Proceedings of the 33rd Annual ACM/IEEE InternationalSymposium on Microarchitecture , 2000.[56] B. R. Rau and C. D. Glaeser, “Some scheduling techniques and an easilyschedulable horizontal architecture for high performance scientific comput-ing,” in
Proceedings of the 14th Annual Workshop on Microprogramming ,1981.[57] A. Roth, A. Moshovos, and G. S.Sohi, “Dependence based prefetchingfor linked data structures,” in
Proceedings of the Eighth InternationalConference on Architectural Support for Programming Languages andOperating Systems , 1998.[58] A. Roth and G. S. Sohi, “Effective jump-pointer prefetching for linked datastructures,” in
Proceedings of the 26th Annual International Symposiumon Computer Architecture , 1999.[59] ——, “Speculative data-driven multithreading,” in
Proceedings of theSeventh International Symposium on High-Performance Computer Archi-tecture , 2001.[60] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis,”
Journal of Computational and AppliedMathematics , vol. 20, no. 1, Nov. 1987.[61] F. J. Sanchez and A. Gonzalez, “Cache sensitive modulo scheduling,” in
Proceedings of the 30th Annual ACM/IEEE International Symposium onMicroarchitecture , 1997.[62] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automaticallycharacterizing large scale program behavior,” in
Proceedings of the 10thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2002.[63] M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H.Pugsley, and Z. Chishti, “Efficiently prefetching complex address patterns,”in
Proceedings of the 48th International Symposium on Microarchitecture ,2015.[64] A. Silberschatz, H. Korth, and S. Sudarshan,
Database Systems Concepts ,6th ed. New York, NY, USA: McGraw-Hill, Inc., 2010.[65] A. Sriraman, A. Dhanotia, and T. F. Wenisch, “Softsku: Optimizing serverarchitectures for microservice diversity @scale,” in
Proceedings of the46th International Symposium on Computer Architecture , 2019.[66] X.-H. Sun and D. Wang, “Concurrent average memory access time,”
Computer , vol. 47, no. 5, May 2014.[67] D. M. Tullsen and J. A. Brown, “Handling long-latency loads in asimultaneous multithreading processor,” in
Proceedings of the 34th AnnualACM/IEEE International Symposium on Microarchitecture , 2001.[68] P. H. Wang, J. D. Collins, H. Wang, D. Kim, B. Greene, K.-M. Chan, A. B.Yunus, T. Sych, S. F. Moore, and J. P. Shen, “Helper threads via virtualmultithreading on an experimental itanium R (cid:13) Proceedings of the 11th International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , 2004.[69] C. Wellons, “Raw linux threads via system calls,” https://nullprogram.com/blog/2015/05/15/, 2015, [Online; accessed 05-August-2019].
70] H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and C. Lin, “Tem-poral prefetching without the off-chip metadata,” in
Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture ,2019.[71] A. Yasin, “A top-down method for performance analysis and countersarchitecture,” in , 2014.[72] X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “Imp: Indirect memoryprefetcher,” in
Proceedings of the 48th International Symposium onMicroarchitecture , 2015.[73] Y. Yuan, Y. Wang, R. Wang, and J. Huang, “Halo: Accelerating flow classification for scalable packet processing in nfv,” in
Proceedings of the46th International Symposium on Computer Architecture , 2019.[74] Y. Zhang, D. Meisner, J. Mars, and L. Tang, “Treadmill: Attributingthe source of tail latency through precise load testing and statisticalinference,” in
Proceedings of the 43rd International Symposium onComputer Architecture , 2016.[75] C. Zilles and G. Sohi, “Execution-based prediction using speculativeslices,” in
Proceedings of the 28th Annual International Symposium onComputer Architecture , 2001., 2001.