[PDF] Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads

Abstract

The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ prefetchers that are customized per application, where gains can be easily scaled across thousands of machines. Helper thread prefetching is such a technique but has yet to achieve wide adoption since it requires spare thread contexts or special hardware/firmware support. In this paper, we propose an inline software prefetching technique that overcomes these restrictions by inserting the helper code into the main thread itself. Our approach is complementary to and does not interfere with existing hardware prefetchers since we target only delinquent irregular load instructions (those with no constant or striding address patterns). For each chosen load instruction, we generate and insert a customized software prefetcher extracted from and mimicking the application's dataflow, all without access to the application source code. For a set of irregular workloads that are memory-bound, we demonstrate up to 2X single-thread performance improvement on recent high-end hardware (Intel Skylake) and up to 83% speedup over a helper thread implementation on the same hardware, due to the absence of thread spawning overhead.

Full PDF

11 Helper Without Threads: CustomizedPrefetching for Delinquent Irregular Loads

Karthik Sankaranarayanan, Chit-Kwan Lin and Gautham Chinya,E-mail: [email protected] Corporation, 2111 NE 25th Ave, Hillsboro, OR 97124 (cid:70)

Abstract —The growing memory footprints of cloud and big data appli-cations mean that data center CPUs can spend signiﬁcant time waitingfor memory. An attractive approach to improving performance in suchcentralized compute settings is to employ prefetchers that are customizedper application, where gains can be easily scaled across thousands ofmachines. Helper thread prefetching is such a technique but has yet toachieve wide adoption since it requires spare thread contexts or specialhardware/ﬁrmware support. In this paper, we propose an inline softwareprefetching technique that overcomes these restrictions by inserting thehelper code into the main thread itself. Our approach is complementaryto and does not interfere with existing hardware prefetchers since wetarget only delinquent irregular load instructions (those with no constantor striding address patterns). For each chosen load instruction, wegenerate and insert a customized software prefetcher extracted from andmimicking the application’s dataﬂow, all without access to the applicationsource code. For a set of irregular workloads that are memory-bound,we demonstrate up to 2X single-thread performance improvement onrecent high-end hardware (Intel Skylake) and up to 83% speedup over ahelper thread implementation on the same hardware, due to the absenceof thread spawning overhead.

NTRODUCTION

The rise of the cloud and big data has caused the memory footprintof applications to grow faster than the pace of technology scaling(i.e., memory capacity and core counts). Moreover, as data parallelworkloads increasingly move away from the CPU into GPUs,FPGAs and accelerators, the CPU is faced with a rise of irregularmemory applications. As a result, data center CPUs can spend asigniﬁcant fraction of execution cycles waiting for the caches [33].Yet, despite the large core counts available, Amdahl’s Law meansthat mitigating such single-thread performance bottlenecks remainscrucial to achieving improved overall performance [26], [74].Interestingly, the centralization of compute in the data center can beseen as an opportunity to be exploited. By customizing performanceoptimizations per application , gains can be scaled across manythousands of machines. This approach relies on obtaining intimateknowledge of an application’s behavior through proﬁling andhardware performance counters [71], and using such informationto extract optimal performance from the hardware for the targetapplication.Speculative precomputation [12], [13], [15], [16], [20], [21],[43], [59], [67], [75], otherwise known as helper threading [31],[32], [34], [35], [41], [42], [68] is such a technique. It reduces thesingle-thread latency of an application by using idle thread contextsin the hardware to spawn special-purpose, speculative threads called helper threads . Helper threads contain computation extracted fromthe main thread and consume the latency of execution on behalf ofit. They encounter cache misses and branch mispredictions aheadof the main thread, and act as execution-driven prefetchers orbranch predictors for the main application, thereby improving itslatency signiﬁcantly. Their beneﬁt accrues from the fact that theyare tailored to the speciﬁc application they are extracted from, andtherefore orchestrate the hardware precisely to suit its needs.However, helper threads can be be tricky to implement efﬁ-ciently. Although the original ideas appeared over two decades ago,we are not aware of any commercial processors with hardwaresupport for helper threads today. Note that industry-strengthcompiler support is available for code generation of helper threadson multicore CPUs [1], [3]. However, the corresponding hardwaresupport for generating low-overhead micro/nano threads (for e.g.,as in [12]) is absent. Hence, the dynamic thread spawn overheadis still signiﬁcant in current operating systems. In the absenceof such specialized hardware support, helper threads have twodisadvantages today: (1) the need for spare thread contexts; and(2) the difﬁculty in synchronizing and match the rates of the mainapplication and the helper threads. In this work, we overcome boththese limitations with an inline prefetching technique inspired bysoftware pipelining [37], [56], yet our method retains an importantbeneﬁt of helper threads: it works without access or modiﬁcationto the application source code. This makes our technique attractiveto cloud service providers who run third-party applications at scale.

Load instructions in a program can fall into three categories: (a)constant address, (b) striding, and (c) irregular. Constant addressloads are loads whose virtual address does not change over multipledynamic instances of the load (for e.g., global variables and stackaccesses). Striding loads are those with successive virtual addressesfollowing an arithmetic progression (for e.g., array accesses).

Irregular loads are those which do not fall into either of theabove two categories (for e.g., indirect and pointer references).Furthermore, loads that frequently miss in the cache are said tobe delinquent . While current hardware mechanisms are effectiveat prefetching regular address patterns [22], delinquent irregularloads (DILs) remain a challenge.For a set of 165 traces from commercial applications identiﬁedby our framework as containing prefetchable DILs (see Section 4for a formal deﬁnition), Figure 1 shows the fraction of CPU cycles a r X i v : . [ c s . A R ] S e p ✁(cid:0)✂(cid:0)✄(cid:0)☎(cid:0)✆(cid:0)✝(cid:0)✞(cid:0)✟(cid:0)✠(cid:0)✁(cid:0)(cid:0) ✡ ☛✡ ☞✡ ✌✡ ✍✡ ✎✡ ✏✡ ✑✡ ✒✡ ✓✡ ☛✡✡ ☛☛✡ ☛☞✡ ☛✌✡ ☛✍✡ ☛✎✡ ☛✏✡✔✕✖✗✘✙✙✚✘✛✜✢✛✣✤✘✥✘✜✖✦✧★✗✘✩✪✫✬✭✮ ✯✰✱✲✳ ✴✵✶✷✳✰ Fig. 1: Fraction of CPU cycles spent waiting for data by speciﬁcmemory-bound, prefetchable delinquent irregular loads (DILs).spent stalled waiting for data at the time of retirement of the speciﬁcDILs. The traces are from several client applications (productivity,games, content creation), server applications (cloud, database, en-terprise, HPC) and CPU benchmarks simulated on a cycle-accuratesimulator modeling the Intel Skylake [18] microarchitecture. Onaverage, each trace has about three prefetchable DILs that arememory-bound. Should all that stall time be reduced to zero, thepotential geometric mean speedup possible in these traces is 15%,which is signiﬁcant headroom. However, these opportunities wereidentiﬁed from a universe of over 2000 traces and building aprefetcher in hardware for such a narrow focus is not a proﬁtable microarchitectural trade-off since such a prefetcher will be unusedmost of the time. On the other hand, large silicon and softwarecompanies currently employ signiﬁcant software resources formanually optimizing select applications (e.g., Figure 5a is from asearch engine). The scale of these high-value applications justiﬁesthe extra effort spent in optimizing them ( [9], [65]). As long asthese applications fall into the right side of ﬁgure 1, targetingthe DILs from them through software is a better strategy thanimplementing a specialized prefetcher in hardware. Hence, wepropose an inline software prefetching technique that selectivelytargets DILs that are memory-bound for prefetching.

Load miss performance countersCandidateRegions CustomPrefetchersProfiler Optimizer Prefetcher

Address Delta ProfilingDelinquent Memory Bound Loads Filter Dataflow AnalysisCycle EnumerationPrefetchable Load Identification Generated Code

Fig. 2: An overview of our customized prefetching approach, whichrequires no access to program source code.Figure 2 summarizes our method. Prefetchers (either in hardware or in software) must meet accuracy, timeliness, andbandwidth criteria; speciﬁcally, an issued prefetch must be tothe correct address just ahead of the actual demand, and mustnot throttle demand loads by consuming too much memorybandwidth. When these criteria are not met, the cache is pollutedand performance suffers. Software prefetchers face an additionalchallenge of unintended interactions with hardware prefetchers [38].We expressly designed our software prefetcher to avoid thesecomplications by placing an emphasis on the selection of theload instructions that we prefetch: we only target loads that areextremely difﬁcult for the hardware to prefetch, namely, memory-bound DILs. Through detailed proﬁling and dynamic dataﬂowanalysis [36], our method identiﬁes candidate memory-boundDILs that are part of inner loops and are likely to beneﬁt fromprefetching with minimal software prefetcher complexity. We callsuch DILs prefetchable and generate customized prefetching code for each. Similar to helper threads [34], [68], the prefetching codeis extracted from the dataﬂow of the application itself. However,unlike helper threads (which require either spare thread contexts orspecial hardware support to run), our method inserts the prefetchercode into the application machine code directly. We ﬁnd thatthe customized prefetching code sequence is usually small andits overhead of implementation is negligible compared to thesigniﬁcant performance gain we achieve with improved prefetching.

This paper makes the following contributions: • We overcome the limitations of helper threads with aninline prefetching scheme that does not require spare threadcontexts or special hardware support; • We eliminate the need for thread synchronization in helperprefetching by employing a statically-controlled prefetchdistance and a prefetcher generation process inspired fromsoftware pipelining; • For a set of irregular memory workloads, we demonstrateup to 2X end-to-end execution time speed-up on currenthigh-end hardware and up to 83% gain over helper threads.

ELATED W ORK

A full discussion of the literature on prefetching is beyond thescope of this paper; the interested reader is referred to Falsaﬁand Wenisch [22] and Lee et. al. [38] for thorough treatmentsof the hardware and software approaches to prefetching and thevarious challenges involved. Limiting our focus to irregular loadprefetching, prior work can be classiﬁed into three main categories:

Microarchitectural techniques use on-chip storage to recordpatterns in the addresses of the irregular load and predict a futureaddress to prefetch if the current address is from one of therecorded patterns [27], [30], [47], [49], [54], [57], [70], [72].These approaches require large on-chip storage, the cost of whichcontinues to preclude their commercial viability. The IndirectMemory Prefetcher (IMP) [72] is an exception—it uses very littleon-chip storage by targeting speciﬁc indirect memory patternsof the form a [ b [ i ]] , where the array a is addressed by a stridingfeeder load b [ i ] . At runtime, IMP records in a hardware table therelationship between the striding load and the irregular load addressand uses this to predict future addresses. The goals of our presentwork are to minimize prefetcher implementation complexity, aswell as improve performance on current hardware. Hence, we2hoose the software implementation route. Moreover, as describedin Section 3, our dataﬂow analysis framework is generic enoughto handle more complex patterns such as a [ f ( b [ i ])] where f is anyarbitrary function. Computation-based techniques typically execute programinstructions ahead of time to prefetch delinquent loads. Whilecomputation-based prefetching [25], [34], [48], [55] can beaccurate, large runahead buffers or spare thread contexts for runninghelper threads are resource-intensive, especially considering theirenergy costs. Since regular loads are the overwhelming majority ingeneral purpose applications, dedicating special hardware resourcesto handle comparatively rare events such as irregular loads maynot represent a good microarchitectural trade-off. In contrast, asoftware implementation is more ﬂexible and can be invoked toincur the cost only when the beneﬁt is known to be greater.There are other limitations to computation-based methodsbeyond just the cost of implementation. For example, runahead [48]is a technique that ignores a branch misprediction and continuesexecution to extract prefetching beneﬁt from control independentinstructions. However, it is not designed to handle dependentmisses (i.e., misses whose addresses are data-dependent on previousmisses). Our approach handles dependent misses by prefetchingthe entire dependency chain through to the missing leaf instruction(see Section 3).Helper threads [12], [13], [15], [16], [20], [21], [31], [32], [34],[35], [41], [42], [43], [59], [67], [68], [75] extract the backwardslice of a delinquent load and run it on a spare thread context.When the latency of the backward slice is less than that of theoriginal loop, the helper thread runs ahead of the main thread andprefetches memory accessed by the main thread into the cache.This technique has the advantage of being ﬂexible enough to beimplemented in hardware [12], [13], [15], [16], [20], [21], [23],[43], [59], [67], [75], or software [31], [32], [34], [35], [41], [42],[68]. It can also work in the absence of high-level source code andhas been demonstrated in a compiler [35], binary tool [41], or adynamic optimizer [42]. However, all prior work in this area haseither required spare thread contexts or special hardware/ﬁrmwaresupport. Virtual Multithreading (VMT) [68] overcomes the need forspare thread contexts by partitioning the registers available to thecompiler between the main and the helper computation. However, itstill requires special yield instructions to orchestrate the transfer ofcontrol between the virtual threads and corresponding modiﬁcationsto the processor ﬁrmware. In our work, by choosing an inlineimplementation, we (1) avoid the need for any extra hardwareor ﬁrmware support; (2) sidestep thread spawning overheads andsynchronization bugs since there are no threads to run; and (3) makestraightforward the rate matching between the main computationand the prefetcher by statically setting the prefetch distance.Similar to our approach, recent work [8] has proposed insertingprefetch hints based upon binary analysis. However, due to theabsence of control ﬂow analysis, it requires specialized hardwaresupport in the form of speculative loads. Moreover, the beneﬁtsdemonstrated are on top of a simulated microarchitecture withoutstate-of-the-art hardware prefetchers such as [63], [72]. In contrast,we show beneﬁt on existing hardware (not requiring specialhardware support) and in comparison with state-of-the-art hardwareprefetchers.

High level language software techniques.

Broadly speaking,software-based prefetchers are typically concerned with insertingprefetch hints [4] into a program or modifying its data structures.Most [6], [14], [38], [45], [58] rely on access to the program’s source code. For instance, Roth and Sohi [58] augment the datalayout of linked data structures with a jump pointer that acts asa prefetch pointer. Others [5], [6] have demonstrated signiﬁcantspeedup with programmable prefetching. However, many real-world situations preclude access to program source code, e.g., whileusing third party libraries or when serving third-party applicationsin the cloud. In such situations, the ability to improve performancein the absence of source code is attractive. This work retains suchcapability from its lineage in helper threads. Finally, while ourprefetcher generation is inspired from software pipelining [37],[56], [61], it is not a static instruction scheduling technique. Ourtarget is performance improvement over dynamically scheduled out-of-order processors that hold multiple iterations of loops in theirinstruction window. The performance improvement is exclusivelydue to the duplication of code that stays a constant number ofiterations ahead.

OTIVATING E XAMPLE

Let us now consider an example scenario where memory-boundDILs occur frequently. Hash tables are widely used because of theiralgorithmic efﬁciency in converting expected linear and logarithmictime operations into expected constant time operations [17]. Forinstance, they are used to implement associative arrays in popularscripting languages such as Python and R, and in relationaldatabases for indexing. However, the underlying hash functionsthat generate hash table keys are designed speciﬁcally to disruptdata locality, i.e., they are designed to enforce irregular access.Thus, when a given hash table has too many unique keys to beheld in on-chip caches, loading hash table entries can become aperformance bottleneck. typedef unsigned long UINT64; typedef unordered_map Histo; void gen_histo(UINT64 *array, UINT64 size, Histo &histo) { for(UINT64 i=0; i < size; i++) if (histo.find(array[i]) != histo.end()) histo[array[i]]++; else histo[array[i]] = 1; } Listing 1: Example Histogram Calculation using C++ STLunordered map, illustrating memory-bound DILs.Listing 1 shows a sample frequency histogram computationover an integer array that uses C++ STL unordered map, thestandard implementation of a hash table, to store the frequencycounts. The code assigns a frequency of 1 to a key encounteredfor the ﬁrst time and increments the frequency every time thesame key is subsequently encountered. Assuming each key isencountered several times, the hot path around the loop is throughthe frequency increment shown in Line 10. Figure 3a shows thex86 64 disassembly for this short, hot path around the loop whencompiled with gcc 6.1 using the -O3 -march=native ﬂags.Please note that this disassembly is part of the implementationof unordered map and that longer, cold paths around the loopare not shown for clarity. It highlights the load instructions inthe loop with different colors: constant address loads are shownin blue, striding loads in green, and irregular loads in yellowor red. Microarchitectural simulation of this loop shows that theaverage cycles per instruction (CPI) measured at retirement (as thenumber of cycles between the retirement of the current dynamic3nstruction and the previous dynamic instruction) is the highest forthe load instruction at instruction pointer (IP) . We call thisthe critical DIL . Note that the load instruction at address is also irregular but not delinquent. This is because it producesaddresses that are small constant offsets from those produced by and hence fall into the same cache line.To see why it is difﬁcult for the microarchitecture to executethis loop, Figure 3b shows the backward slice [36] of the criticalDIL (red: it is critical because it stalls the pipeline afterbecoming the most senior instruction in the re-order buffer waitingfor data from memory). The backward slice captures the dataﬂowbetween successive iterations of the DIL. An edge from a lower IPto a higher IP indicates the dataﬂow within an iteration while anedge from a higher IP to a lower IP indicates the dataﬂow from aprevious iteration of the loop. A cycle in this graph indicates a loop-carried dependence. We can see that a single striding load (green) feeds all the DILs in this loop. In Listing 1, this striding loadcorresponds to the variable array . Hardware prefetchers prefetchthis striding load successfully. Looking at the path from the stridingload to the critical DIL (green to red path) in the backward slice,we can observe that as part of the hashing function, the value fromthe striding load is divided by a constant and the remainder isused to calculate the address of the DIL in successive indirections.These indirections occurring after the non-linearity (due to the div instruction) are beyond the capability of hardware prefetchers today.When the number of unique keys in the hash table is too largeto ﬁt in the on-chip caches, each of these indirections becomesDRAM-bound. Even with large out-of-order instruction windows,the latency of three consecutive round-trips to DRAM becomesimpossible to hide.Neither IMP [72] nor runahead [48] improves the performanceof this loop. The non-linear relationship between the value of thestriding feeder load and the consumer DIL is outside the purviewof IMP, which only captures linear relationships of the form ax + b .Runahead, on the other hand, can alleviate branch mis-speculations,but the chain of dependent cache misses will ultimately cause therunahead engine to stall for data. Before delving into our approach, we discuss the challenges inimplementing a prefetcher using traditional helper threads. Priorwork has studied several design choices including hardware supportfor extremely lightweight threads [20] and a variety of triggermechanisms, including one helper thread spawning another inseries [16]. However, our requirement is that the prefetcher mustbe able to run on existing CPUs without any additional hardwareor ﬁrmware support. Hence, we choose the clone system call inLinux [69] to create helper threads. As a ﬁrst step, we measurethe overhead of spawning a thread using this approach to beapproximately 3-30 µ s, which is equivalent to several tens- tohundreds-of-thousands of CPU cycles in our test system. Next, wespawn the backward slice of the critical DIL as a separate thread ateach entry into the loop. Since the backward slice is much smallerthan the main loop, it runs ahead despite the 3-30 µ s delayed start.Note that the loop has calls to functions that allocate memoryon-demand for the hash table. Thus, if the helper thread runsarbitrarily ahead, it can cause segmentation faults for the mainprogram by accessing memory that has not yet been allocated. Here,we make two design decisions: (1) we exit the helper thread at allpaths other than the hot path through the loop; and, (2) to balance the overhead of thread spawning against the performance beneﬁtdue to prefetching, we skip a tunable fraction of the loop beforestarting the helper threads. This allows time for most memoryallocation to complete before we begin prefetching.We run the application with the helper threads for two differentinputs and examine the speedup over the baseline implementationwithout prefetching. We run the helper version in two differentmodes: ﬁrst, we allow the helper thread to run only on the samecore as the main thread. In this case (2T), the main and helperthreads use the two SMT contexts of the same core. In the secondmode (All), we allow the threads to be scheduled in any of thecores available on the machine. The speedup is shown in Figure 4for different settings of the tunable thread start delay (skip).When the tunable skip thresholds are low, the helper threadsare created and destroyed too often alongside the frequent memoryallocation. When there is no restriction on the number of parallelthread contexts available, this does not cause too much slowdown(right, “All”), but with only two SMT contexts on the same core, thespawning penalty is prohibitive (left, “2T”). On the other hand, highskip thresholds result in lost opportunity. Moreover, the maximumspeedups on the right are low, due to cache interference caused bythreads hopping to different cores. Furthermore, the optimal skipthreshold for the 2T case varies across inputs (it is 87.5% for Input1 and 37.5-50% for Input 2). This serves to illustrate how tricky itcan be to tune the helper thread implementation. ETHOD

In the previous section, we explained the problem of memory-bound DILs through a hash table example and outlined thechallenges in implementing a prefetcher with helper threads. Here,we will outline our approach to a solution, with a reminder that wewant to create a prefetcher implementation without threads.Observing the backward slice shown in Figure 3a, we see thatthe one cycle in the graph is comprised of a single instruction , i.e. , the stride address increment, and that it is the onlyloop-carried dependence in this backward slice. Note that a cyclein the backward slice captures the essential relationship betweenthe addresses produced by a DIL in successive iterations. If theinstructions in the cycle can be executed efﬁciently by the hardware,then it becomes possible to overcome the bottlenecks outsidethe cycle through software prefetching. Conversely, if the cyclecannot be executed efﬁciently by the hardware due to true datadependencies, then the performance of such a DIL cannot beimproved with software prefetching.Speciﬁcally, if the backward slice of a DIL has no cycles withany irregular memory operations, then such a cycle can be executedefﬁciently by the hardware. Such a cycle can be run multipleiterations ahead of the main computation by a software prefetcherand we describe DILs with such a backward slice as runnable . Onthe other hand, if the cycles in the backward slice have delinquentirregular memory operations, then running a few iterations aheadgives no advantage; the performance bottleneck would simply shiftfrom the main computation to the prefetcher computation instead.This is the classic situation of pointer chasing and we refer tosuch DILs as chasing DILs. Short of moving the whole cycle ofchasing computation closer to memory (through techniques suchas processing in memory), not much can be done to improve suchloads.To explicitly contrast runnable and chasing DILs, we providerespective examples extracted from real applications in Figure 5.4 ✁(cid:0)✂☎✆ ✝✞✟✠ ✡☛☞✌✍✎✏☛☞✂✄✁(cid:0)✂✑✆ ✝✞✟✠ ✄✁☎✡☛☞✒✓✎✏☛☞☎✄✁(cid:0)✔✒✆ ✁✞☞✕ ☛✖✗✁✏☛✖✗✁✄✁(cid:0)✔✘✆ ✝✞✟✠ ☛☞✂✏☛☞✔✁✄✁(cid:0)✔(cid:0)✆ ✗✙✟✠ ☛☞☎✄✁(cid:0)✔✂✆ ✝✞✟✠ ✡☛☞✒✓✎✏☛☞✔✁✄✁(cid:0)✔✗✆ ✝✞✟✠ ✡☛☞✔✁✏☛☞✗✁✏☎✎✏☛☞✔✁✚✛✜✢✣✤ ✥✦✧★ ✩✪✫✛✬✩✪✣✚✚✛✜✢✭✤ ✮✯✰✮★ ✩✪✱✛✬✩✪✱✛✚✛✜✢✲✤ ✳✯ ✚✛✜✲✭✄✁(cid:0)✌✂✆ ✝✞✟✠ ✡☛☞✔✁✎✏☛☞✑✁✄✁(cid:0)✌✑✆ ✝✞✟✠ ✄✁☎✡☛☞✑✁✎✏☛☞✴✙✚✛✜✵✚✤ ✵✥✶★ ✩✪✰✷✬✩✪✸✚✛✜✵✹✤ ✳✺✯ ✚✛✜✻✼✚✛✜✵✻✤ ✥✦✧★ ✩✪✢✶✬✩✪✰✷✚✛✜✵✼✤ ✥✦✧★ ✩✪✣✽✬✩✪✫✷✚✛✜✵✢✤ ✱✫✫★ ✾✚✛✣✬✩✪✣✹✄✁(cid:0)✑✿✆ ✔✗✗✠ ❀✄✁☎✏☛☞✌✍✚✛✜✫✭✤ ✵✱❁❁ ✚✛❂✜✚✚✛✜✫✼✤ ✱✫✫★ ✾✚✛✣✬❃✩✪✱✛❄✚✛✜✫✵✤ ✵✥✶★ ✩✪✣✹✬✩✪✢✛✚✛✜✫❂✤ ✳✺✯ ✚✛✜✸✼ (a) Disassembly. (cid:0)✁✂✁✄☎✆✝ ✞✟✠✡☛☞✠☛ ☞✌✌✍✎✡✡ ✏✟☞✌✡✝ ✑☛✍✒✌✒✠✓ ✏✟☞✌✡✝ ✔☛✕✎✍ ✒✍✍✎✓✖✏☞✍ ✏✟☞✌✡✝ ✞✍✒☛✒✗☞✏ ✘✙✚✛✜✢☎✝ ✘☞☛☞✣✏✟✤ ✏✎☞✌✒✠✓ ☛✟ ☛✕✎ ✘✙✚ ✥✦☞✗✧✤☞✍✌ ✡✏✒✗✎★ (b) Dataﬂow.

Fig. 3: Disassembly and dataﬂow of the hot loop in Listing 1. (cid:0)✁(cid:0)(cid:0)(cid:0)✁✂(cid:0)(cid:0)✁✄(cid:0)(cid:0)✁☎(cid:0)(cid:0)✁✆(cid:0)✝✁(cid:0)(cid:0)✝✁✂(cid:0)✝✁✄(cid:0)✝✁☎(cid:0) ✞✟✠✡☛ ☞ ✌ ✍✎ ✞✟✠✡☛ ✍ ✌ ✍✎ ✞✟✠✡☛ ☞ ✌ ✏✑✑ ✞✟✠✡☛ ✍ ✌ ✏✑✑✒✓✔✔✕✖✓ ✗✘✙✚✛✜✢✣✤✥ ✗✘✙✚✛✢✤✥ ✗✘✙✚✛✦✧✣✤✥ ✗✘✙✚✛✤★✥✗✘✙✚✛✩✢✣✤✥ ✗✘✙✚✛✧✤✥ ✗✘✙✚✛✪✧✣✤✥ ✗✘✙✚✛✜★★✥

Fig. 4: Performance results from the helper thread implementation.The two plots on the left are from when the main and helper threadsare restricted to SMT contexts in the same core. The two plots onthe right are obtained by allowing the threads to be scheduled onany of the available cores.The coloring scheme remains the same as in Figure 3. The backwardslice of the runnable DIL shown in Figure 5a has three totalcycles, but none have irregular memory operations. In contrast, thebackward slice of the chasing DIL shown in Figure 5b has twocycles, one of which has an irregular memory operation at (yellow).Through dataﬂow analysis, we can determine if a DIL isrunnable. However, merely being runnable does not guaranteethat a DIL is also prefetchable. We must also examine the controlﬂow within the loop. If the backward slice of a DIL varies alongthe different control ﬂow paths through the loop, then the backwardslice is control dependent on the branches within the loop. Apopular example of such a situation occurs in an array-basedimplementation of a binary search tree. If the current search node is at index x , the next node to be searched can either be the left child(at index 2 x +

1) or the right child (at index 2 x + k iterations ahead, then there are 2 k possible addressesto prefetch. We have the option of either prefetching all of thoseaddresses or implementing a software-based branch predictor toselect one of the addresses to prefetch. Both of these optionsare unrealistic and hence we deliberately exclude such situationsby considering only DILs that have backward slices that remain control independent of all the branches within the loop. Finally,when a DIL is runnable as well as control independent , we callit prefetchable . These two criteria comprise our DIL screen; oursoftware prefetcher framework only targets DILs that pass thisscreen for custom prefetching code generation.Once a prefetchable DIL has been identiﬁed, inspired bysoftware pipelining [37], [56], we take a carrot and the horse approach to prefetching it. We duplicate the backward slice codeand assign new registers to it. By analogy, this code is the “carrot”and the main computation is the “horse”. Prior to the entry intothe loop, the carrot is ﬁrst extended k iterations ahead of the horse.We call this phase in the dynamic execution the head start phase.After the entry into the loop, the carrot locks steps with the horseand stays a constant k iterations ahead. We call this phase in thedynamic execution the stay ahead phase. During the last k iterationsof the loop, the carrot ceases to stay ahead and merges with thehorse. We call this phase of dynamic execution the join phase.Finally, since the carrot overwrites the architectural registers, wealso need to save them onto the stack at loop entry and restore them at loop exit.This process is more formally described in Figure 6. Theﬁgure contrasts the dynamic instruction streams and the memoryaddresses accessed before and after the insertion of the softwareprefetcher code. At the top, each iteration in the original instructionstream has a DIL (marked DIL i ) that demands a particular memory5 (a) A Runnable DIL. (b) A Chasing DIL. Fig. 5: Examples of a Runnable and a Chasing DIL. The runnable DIL has three cycles, but no irregular memory operations are part ofthese cycles. In contrast, the chasing DIL has two cycles and one of them has an irregular memory operation ( ). D I L D I L n D I L n - k D I L D I L k . . . . . . D I L D I L n D I L n - k D I L D I L k . . . . . . C P C P C P k C P n - k D I L k + D I L n - k - C P n - k - Head Start Stay Ahead . . . . . .

Join

Save Restore

MemoryOriginal InstructionStreamOriginal +

Custom

PrefetcherIterations 1 2 k n-k nk+1

Fig. 6: Overview of the phases in our prefetching scheme. DILs(marked DIL i ) at each loop iteration demand particular memoryaddresses. We insert customized prefetching code (yellow) thatruns k iterations ahead to prefetch those addresses and mitigate thedelinquency.address. At the bottom, customized prefetching code (yellow) isinserted into the instruction stream. These are given a head start torun k iterations ahead such that the addresses they prefetch mitigateall the DILs within stay ahead and join phases. Please note thatalthough the carrot and horse approach sounds similar in principleto software pipelining, it is not a instruction scheduling techniqueand the speedups are exclusively because of the duplication of codethat stays a constant number of iterations ahead.With this overall picture in mind, we provide the details of ourmethod next. The ﬁrst step is the identiﬁcation of DRAM-bound load instructions.For this purpose, we employ detailed proﬁling and dataﬂow analysisof the application of interest. Our analysis infrastructure uses apintool [44] to generate the basic block vector proﬁles of theapplication at a 10M instruction granularity and the SimPoint [62]methodology to identify representative regions for microarchitec-tural simulation. We implement K-means clustering and augmentit with silhouette analysis [60] to ensure clusters of good quality.We then use PinPlay [53] to generate two traces for each SimPoint,a short trace for functional simulation and dataﬂow analysis anda long trace for cycle-accurate microarchitectural analysis. The short functional simulation traces are 10M instructions long. Thelong microarchitectural traces are 310M instructions long in orderto accommodate in simulation a cache warmup period of 295Minstructions, microarchitectural warmup period of 5M instructions,and a detailed cycle-accurate simulation of 10M instructions.Next, we perform cycle-accurate simulations of a microar-chitecture resembling Intel’s Skylake [18] CPU on an in-housex86 64 performance simulator. The cycle simulations producea list of DRAM-bound load instructions, deﬁned as those withan average CPI higher than the latency of the last level cache.This output list is sorted by the fraction of the total L1 datacache misses produced by each load instruction. We then selectthe delinquent load instructions covering the top 99% of all theL1 data cache misses for further dataﬂow analysis. It is worthnoting that we chose this route of implementation through trace-level, cycle-accurate microarchitectural simulation, but there areother ways to identify DRAM-bound load instructions, e.g. , withassistance from hardware performance counters [19], functionalcache simulation [28], proﬁling DRAM accesses [10], [66] or evenstatically [52].The next step in our analysis is to identify the irregular loadsfrom the list of DRAM-bound loads. We achieve this through thecalculation of address deltas, deﬁned as the numerical differencebetween the addresses produced by successive executions of aload instruction. We compute the address delta histograms forall the DRAM-bound loads in the short traces. An n-dimensionalregular array accessed inside a loop can produce n different addressdeltas. Hence, in order to ﬁlter out high dimensional regular arrayscommon in numerical code, we choose a threshold of 10 deltas, i.e. , we select only those load instructions with at least 10 distinctaddress deltas covering the top 90% of the executions. This is ourDIL candidate list.We then build the dynamic control ﬂow graph [7] using theshort traces and determine the loop immediately encompassing eachDIL. After that, we enumerate all the different control ﬂow pathswithin the loop. For each such path, we perform dynamic dataﬂowanalysis [36] to compute the backward slice graph and enumerateall the simple cycles in it [29]. With the information from theaforementioned address delta analysis, we ﬁnd the cycles that donot involve any irregular memory operations and determine whetherthe DIL is runnable. When a runnable DIL has the same backwardslice along all the control ﬂow paths within its encompassingloop, we ﬂag it as control independent and hence prefetchable. For6hese graph computations, we utilize the networkx [24] package inPython.Once all the prefetchable DILs and their encompassing loopsare identiﬁed, we group the DILs into loops and determine whichamong them inside the same loop produce addresses that are at asmall constant offset from one another. We drop all such DILs fromour list except the DIL with the largest average CPI (the criticalDIL) since such addresses either fall within the same cache line asthe critical DIL or regular hardware prefetchers will handle theseproperly. The load instruction at address in Figure 3 isan example of such a case. Moreover, to avoid alias analysis, werestrict ourselves to situations where the addresses of the stores inthe backslice can be inferred statically. Through these successivescreens, we are ultimately left with only those prefetchable DILsthat are most challenging for the hardware to prefetch. We now illustrate the generation of the customized prefetchingcode for the phases shown in Figure 6, using the hash table examplefrom Figure 3. Keep in mind that we do not operate on the sourcecode and hence begin with the loop shown in Figure 3a. We insertthe prefetcher assembly into the application’s assembly directly.As a ﬁrst step, we attempt to ﬁnd unused architectural registersinside the loop. When there are no unused registers available, wecreate new local variables on the stack and select registers to spillonto them in the following order for minimal performance impact:1) Registers only written to but never read from inside theloop (only the last write to these registers need to be madevisible outside the loop);2) Registers only read from but never written to inside theloop (all references to these registers will be replaced bytheir corresponding stack loads).For our example in Figure 3a, it turns out that registers r r

14 and r

15 are unused inside the loop. Among these, r

11 is caller-saved and there is a function call inside the loop, meaning it couldpotentially be used inside the function call. Thus, we choose r r

15 as the registers to use for our carrot computation i.e., insidethe customized prefetching code. As discussed before and shownin Figure 6, the ﬁrst step is to save these registers onto the stack: pushq %r14 pushq %r15 Listing 2: The save phase.Next is the head start phase, also performed at loop entry,where the prefetch computation gets a k -iteration head start. Inour example, rbp is the only register written inside the cycle inthe backward slice graph. Hence, we duplicate it onto r

14. Wealso use r

15 as scratchpad to perform the loop boundary check bycomparing it with the loop limit in rbx , as follows. .set k, 8 .set logk, 3 movq %rbp, %r14 movq $k, %r15 addq $0x1, %r15 cmpq %rbx, %r15 jge SKIP1 movq $0x8, %r15 shlq $logk, %r15 addq %r15, %r14 SKIP1:

Listing 3: The head start phase.The next two phases are the (1) stay ahead phase, where theprefetcher (carrot) computation stays ahead of and in lock step withthe main (horse) computation, and (2) the join phase, where theprefetcher computation no longer stays ahead and ultimately mergeswith the main computation. Both of these phases are inserted intothe loop body and are shown in Listing 4. For clarity, we distinguishthe inserted code from the existing code by highlighting the insertedcode in yellow. START: movq (%r14),%r15 movq (%rbp),%r9 movq 0x8(%r12),%r8 xorl %edx,%edx movq %r15,%rax divq %r8 movq %rdx, %r15 xorl %edx,%edx movq %r9,%rax divq %r8 movq (%r12),%rax movq (%rax,%r15,8),%r15 movq (%rax,%rdx,8),%rax movq %rdx,%r10 testq %rax,%rax je LABEL1 testq %r15, %r15 je SKIP2 movq (%r15),%r15 prefetcht0 0x8(%r15) SKIP2: movq (%rax),%rcx movq 0x8(%rcx),%rsi cmpq %rsi,%r9 jne LABEL2 movq %rbp,%rsi movq %r12,%rdi addq $0x1,%r13 addq $0x8,%rbp addq $0x8, %r14 movq %rbx, %r15 subq $k, %r15 cmpq %r15, %r13 jl SKIP3 movq %rbp, %r14 SKIP3: call 0xf60 addq $0x1,(%rax) cmpq %r13,%rbx jne START Listing 4: The stay ahead and join phases.The last step is to restore the saved registers at all exit pointsof the loop. popq %r15 popq %r14 Listing 5: The restore phase.After the insertion of the prefetcher code, to ensure correctness,we compare the output of the optimized version to that of theunoptimized version and require that they match exactly, except7or those outputs dependent on operating system behavior such astiming measurement, random number generation, signal handling, etc . XPERIMENTAL E VALUATION

Recall from ﬁgure 1 that while DIL prefetching may not beneﬁtall applications, some irregular applications can beneﬁt a lot (rightside of ﬁgure 1). For instance, several high value cloud applicationsfall into this category. Hence, we evaluate our proposal on a set ofirregular memory workloads similar to the work by Ainsworth andJones [6] (we do not use the applications from ﬁgure 1 since wedon’t have access to their binaries)We study three applications from their work that are bottle-necked by DRAM-bound DILs and add two more to the evaluationincluding the hash table example we discussed in Sections 3and 4. Since our focus is on single thread performance, we utilizethe serial versions of the benchmarks for experimentation. Wecompile all benchmarks with gcc 6.1.0 using the ﬂags -O3-march=native on an Intel Xeon E5 server CPU. We run allthe analysis tools for prefetchable DIL identiﬁcation and generatethe customized prefetching code on the same server as well.

We now provide a brief overview of the applications studied. • STLHistogram is the example we discussed in Sections 3and 4. It generates a random array of integers and computesthe frequency histogram of the array using C++ STLunordered map. It takes the size of the array and the numberof unique elements in it as arguments. Microarchitecturalperformance of this application suffers when neither the in-put array, nor the frequency histogram ﬁts inside the on-chipcaches. We choose this benchmark due to the popularity ofhash tables in programs and the potential for customizedprefetching to improve performance. Please note that sinceopen address hash tables are popular, we also studied apolicy based implementation of STLHistogram. While thebaseline performance of this new version was 7X better thanthe unordered map version, the performance improvementopportunity was very similar to the unordered map versionwith a single prefetchable memory bound DIL causingmost of the stalls. Hence, we report results only for theunordered map version. • PageRank is an implementation of the popular web-pagerelevance ranking algorithm [50] using the C++ BoostGraph Library [2] (BGL). It is a graph algorithm that ranksa website based upon the ranks of the websites that link toit. • HashJoin [11] from the University of Wisconsin imple-ments the join operation of a relational database [64] inmain memory using hash tables. The join operation is verycommon in Structured Query Language (SQL) queries. • Graph500CSR is part of the Graph500 [46] benchmarksuite designed to rate supercomputer systems on their data-intensive performance. It performs Breadth-First Search(BFS) on a large graph implemented using a compressedsparse rows (CSR) data structure. • Cuckoo [73] is an application modeling packet processingin the context of Network Function Virtualization (NFV)using the cuckoo hashing algorithm [51]. We run the sequential versions of these applications on theinputs shown in Table 1 and generate traces as discussed inSection 4.1. An automatic tool analyzes the traces to producethe list of prefetchable DILs, the loops they belong to, and alist of available registers for code generation. The customizedprefetching code is then generated semi-automatically with manualintervention. Speciﬁcally, our scripts generate a skeletal prefetchercode with the duplicated backslice and a list of candidate registers.However, register ﬁlls/spills, null-pointer skips and handling slicesacross function calls are done manually, Another automatic toolthen statically rewrites the original function in the binary with adynamic version that allocates the optimized code in the heap andcalls it through a function pointer. We then run the optimized binaryto ensure that its output matches the original. For performancemeasurement, we employ an Intel Core i9-7900X Skylake CPUwith all the hardware prefetchers enabled, running at 3.3 GHz andfrequency scaling disabled in the BIOS. We choose an evaluationsystem that is different from the one used for compilation tosimulate a binary-only scenario. We run the applications ﬁve timeseach and record the median wall clock time before and afteroptimization. We also measure the dynamic instruction overheadof the optimized versions using a pintool [44]. The last columnof Table 1 shows the dynamic instruction counts of the maincomputation in the original applications.

Benchmark Input Dynamic In-str (B)

STLHistogram 100M array, 10Munique elements 7.9PageRank [2], [50] web-Google.txt [40] 1.1HashJoin [11] 016M build.tbl,256M probe.tbl 55.8Graph500CSR [46] -s 18 -e 10 11.2Cuckoo [51], [73] 8M ﬂows 10.2

TABLE 1: Benchmarks and inputs (Input 1).

First, we present the results of the control and dataﬂow analysesfor the applications.

Benchmark DILs Prefetch-ableDILs Loops FunctionName

STLHistogram 4 3 1 gen histoPageRank 4 4 2 pagerankHashJoin 2 2 1 realprobeCursorGraph500CSR 6 6 2 make bfs treeverify bfs treeCuckoo 3 2 1 rte hash lookupbulk data

TABLE 2: Results of control and dataﬂow analyses.The data in Table 2 shows that of the 19 total DILs, 17 areprefetchable. We proceed with the performance evaluation of theprefetchers for these DILs.

For the ﬁve applications, we vary the prefetch distance fromtwo iterations to 256 iterations in powers of two. Note that we8 ✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩ (a) Speedup. (cid:0)✁✂(cid:0)✁✄☎✁(cid:0)☎✁☎☎✁✆☎✁✝☎✁✞☎✁✟☎✁✠ ✆ ✞ ✂ ☎✠ ✝✆ ✠✞ ☎✆✂ ✆✟✠✡☛☞✌✍✎✏✑☞✒✓✔✕✏✓✎✖☞✗✘✙✔✚✙✌✛ ✜✢✣✤✣✥✦✧ ★✩✪✥✫✬✦✣ ✭✩✥✣✢✫✥✩✮✬✪✯✰✱✲✳✴✵✶✷✸✹✺✻ ✼✺✸✽✾✺✿❀ ✳✺✵❁❂✷✴✿ ❃✹✺❄❁❅❆❆❇✰✾ ❇❈❉❀✷✷ (b) Dynamic Instruction Overhead.

Fig. 7: Performance of the prefetching scheme (Input 1).choose powers of two for only a minor convenience in codegeneration since multiplication can be replaced with shifts; itis not a fundamental restriction in our approach and can easilybe changed to accommodate any arbitrary lookahead. We thenverify that the outputs of the optimized binaries match with theoriginal ones and then measure the performance of the optimizedversions. The speedup from the performance optimization is shownin Figure 7a. The corresponding dynamic instruction overhead isshown in Figure 7b. The x -axis on both the ﬁgures is the prefetchdistance, which is the number of iterations of lookahead availablefor the prefetcher. On the y -axis in Figure 7a is the ratio of themedian wall clock time before optimization to that after. Figure 7bplots on its y -axis the ratio of the total dynamic instructions ofthe baseline to that of the the optimized executions. Note that thespeedups reported include dynamic instruction overhead since wemeasure wall clock time.

For the applications and inputs described in Table 1, there isa signiﬁcant speedup of 21-94% due to our software prefetchers.This speedup is in spite of signiﬁcant dynamic instruction overheadin some cases. Hence, this result clearly demonstrates that we aresuccessful in accurately prefetching the critical load addresses in amanner that does not interfere with the memory bandwidth or withany hardware prefetchers.A pattern to observe in the data is that even with only a fewiterations of the prefetch distance lookahead, the performanceincreases signiﬁcantly. In fact, except for PageRank and Cuckoo,the performance improvement is stable across the entire range ofprefetch distances. This is because the loop sizes are such that onlya few iterations ﬁt in the dynamic instruction window of the CPU.Hence, even with a small lookahead, the prefetcher reaches outsidethe instruction window to be effective. However, the behavior ofPageRank and Cuckoo deserve further explanation.PageRank operates on the Web-Google graph dataset [40],which has an average degree of less than six. The inner loopencompassing the prefetchable DIL iterates over all the neighborsof a graph node. Hence, the trip count of this loop is equal tothe average number of a node’s neighbor or its average degree.Therefore, prefetch distances longer than six skip the loop fully anddo not help much. This behavior can also be seen in Figure 7b in the dynamic instruction overhead data. A similar behavior occursin Cuckoo as well, where the prefetchable DILs are from an innerloop with a ﬁxed iteration count of 32. The lost opportunity costdue to small iteration counts is the reason for the reduction inperformance with increasing lookahead.Note that the performance gains for STLHistogram andHashJoin are much higher than those for the remaining three. In theformer two, the critical DIL is fed by a strided load after passingthrough a hash function and multiple indirections. However, in thelatter three, the strided load feeds the DIL directly through fewerindirections (and a hash function in Cuckoo). Thus, as discussed inSections 3 and 4, the bottleneck of the chain of dependent cachemisses is much larger for the former applications than the latter.Consequently, the performance boost obtained by mitigating themis also higher.5.2.2.1 Impact of Inputs: Next, we select a set of largerinputs for our applications and run the optimized binaries on thisset to study sensitivity to different application inputs. Table 3 liststhe new inputs used for this experiment.

Benchmark Input DynamicInstr (B)

STLHistogram 200M array, 10Munique elements 12.7PageRank cit-Patents.txt [39] 4.2HashJoin 032M build.tbl,512M probe.tbl 148.2Graph500CSR -s 21 -e 10 90.7Cuckoo 16M ﬂows 20.5

TABLE 3: Alternative inputs for the optimized benchmarks (Input2). Figure 8 displays the speedup and the dynamic instruction over-head for the optimized binaries running on these new inputs. Wecan see that the speedup has improved for STLHistogram, stayedabout the same for HashJoin/Cuckoo and decreased for PageRankand Graph500CSR. Overall, the speedups range from 10%-100%and are still signiﬁcant over the baselines. For PageRank, the cit-Patents dataset [39] has an average degree of 4 . ✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩ (a) Speedup. (cid:0)✁✂(cid:0)✁✄☎✁(cid:0)☎✁☎☎✁✆☎✁✝☎✁✞☎✁✟☎✁✠ ✆ ✞ ✂ ☎✠ ✝✆ ✠✞ ☎✆✂ ✆✟✠✡☛☞✌✍✎✏✑☞✒✓✔✕✏✓✎✖☞✗✘✙✔✚✙✌✛ ✜✢✣✤✣✥✦✧ ★✩✪✥✫✬✦✣ ✭✩✥✣✢✫✥✩✮✬✪✯✰✱✲✳✴✵✶✷✸✹✺✻ ✼✺✸✽✾✺✿❀ ✳✺✵❁❂✷✴✿ ❃✹✺❄❁❅❆❆❇✰✾ ❇❈❉❀✷✷ (b) Dynamic Instruction Overhead. Fig. 8: Prefetcher performance on different input data (Input 2).the previous Web-Google dataset. Thus, as discussed earlier, thedrop in its speedup can be attributed to the reduced trip count ofits inner loop. As for Graph500CSR, the new input has a highernumber of vertices but the same average degree as before and theperformance contribution of the DILs is lower than before. Hence,the corresponding speedup by prefetching them is also lower.5.2.2.2 Impact of Microarchitecture: The results shown sofar were for a single microarchitecture. To study the impact of a dif-ferent microarchitecture, we generate traces from the unoptimizedand optimized binaries and perform cycle accurate simulations onthem for an aggressive hypothetical microarchitecture that is 2Xwider and 3.5X deeper than Skylake. We also model two aggressivehardware prefetchers similar to VLDP [63] and IMP [72] since theywere published after the release of the Skylake microarchitecture.Figure 9 shows the result of the experiment. Unlike Skylake, thedynamic instruction window of the hypothetical microarchitecturecan hold many more iterations of the loops. Hence, short prefetchdistances do not go beyond the instruction window. This is whythe speedups are lower for shorter lookaheads (for the benchmarkswithout small loop iteration counts). However, once the prefetchdistances are sufﬁcient to look beyond the instruction window, thespeedups stabilize afterwards. The extra latency hiding offeredby the 3X increase in out-of-order depth causes the DILs fromPagerank to not be the bottlenecks of performance anymore. Hence,the instruction overhead for the benchmark shows up as a slowdownin the chart. Nevertheless, the speedups continue to be signifcantoverall(the bar for STLHistogram is missing for the prefetchdistance of 128 due to simulation failure).The stability of speedups across prefetch distances beyond aparticular threshold is helpful in case of variable DRAM latencies.Setting the lookahead for the worst-case memory latency canprovide speedups that are robust to the variability. Moreover, thefact that the speedups remain signiﬁcant even under contemporaryaggressive hardware prefetchers, emphasizes that our approach iscomplementary to hardware and minimizes interference.5.2.2.3 Comparison with Helper Threads: We now com-pare the inline prefetcher to traditional helper threads. To providethe techniques with equal hardware, we restrict the helper threadimplementations to one additional SMT context from the same (cid:0)✁✂✄✁(cid:0)✄✁☎✄✁✆✄✁✝✄✁✂☎✁(cid:0) ☎ ✆ ✂ ✄✝ ✞☎ ✝✆ ✄☎✂ ☎✟✝✠✡☛☛☞✌✡ ✍✎✏✑✏✒✓✔ ✕✖✗✒✘✙✓✏ ✚✖✒✏✎✘✒✖✛✙✗✜✢✣✤✥✦✧★✩✪✫✬✭ ✮✬✪✯✰✬✱✲ ✥✬✧✳✴✩✦✱ ✵✫✬✶✳✷✸✸✹✢✰ ✹✺✻✲✩✩

Fig. 9: Prefetcher performance on a hypothetical microarchitecturethat is 2X wider, 3.5X deeper and includes aggressive prefetcherssimilar to VLDP [63] and IMP [72]core as the main thread. We also select the best tuning parameters(prefetch distance for the inline prefetcher and launch trigger/fre-quency for helper threads) for both the schemes. Figure 10 showsthe results of the experiment.Our inline prefetcher outperforms helper threads due to thelatter’s thread spawning overhead. Dropping the outlier (Cuckoo),the speedups range from 13-83%, which is signiﬁcant. For Cuckoo,the number of thread spawns is prohibitive for helper threading tobe competitive. From the results of these experiments, we concludethat the proposed prefetcher scheme is accurate in targeting thecritical load instructions and improves single thread performanceof the targeted applications signiﬁcantly. It does so without therequirements of traditional helper threading such as idle threadcontexts and special support from hardware or ﬁrmware.

As a binary modiﬁcation technique, debuggability can be affecteddue to optimization. Hence, it is a good idea to restrict optimization10 ✄(cid:0)☎✄✁✂☎✆✝✞✆✝✟✆✝✠✆✝✡✆✝☛✆✆✝ ☞✌✍✎✏✑✒✓✔✕✖✗ ✘✖✔✙✚✖✛✜ ✎✖✑✢✣✓✏✛ ✤✕✖✥✢✦✆✆✧☞✚ ✧★✩✜✓✓✪✫✬✭✫✮✯✰✱✲✬✳✴✫✱✫✮✯✴✳✫✬✵✫✶✲✫✬✷✸✬✫✹✺✻ ✼✽✾✿❀ ❁✼✽✾✿❀ ❂

Fig. 10: Percent improvement of inline prefetcher over helperthreads.only to performance critical code.As a prefetching scheme running on the CPU, we drop allthe pointer chasing loads from the purview of our optimization.Such a restriction is not essential. The backward slices andcycles with chasing loads are ideal for ofﬂoad into Processing-In-Memory (PIM). Future work could explore means of implementingsuch ofﬂoading. Also, we have restricted ourselves to softwareimplementation on existing hardware, which is not mandatory.The proﬁle-based, ofﬂine dataﬂow analysis could advice hardware-software co-design and prefetchers could be implemented in customhardware instead. With the advent of Field Programmable GateArrays (FPGA), custom hardware prefetchers closely coupled witha processors pipeline are another potential direction of investigation.

ONCLUSION

In this paper, we have described an inline software prefetcher forDRAM-bound Delinquent Irregular Loads (DILs). In order to avoidinterfering with the hardware prefetchers for regular loads and tokeep the bandwidth impact and cache pollution to a minimum,we have designed the scheme to be highly selective in targetingonly the DILs most difﬁcult for the hardware to prefetch. In spiteof being selective, our approach has a signiﬁcant potential forperformance enhancement as demonstrated by four applicationsfrom different domains: a C++ hash table implementation, thePageRank website ranking algorithm, a database join algorithmand the Graph500 breadth-ﬁrst search of a large graph. Across allinputs to the test applications, speedup due to our inline prefetchersranged from 10% to 100% on a high-end Intel Skylake system.Our approach performs better than a traditional implementationof helper threads due to the latter’s thread spawning overhead.It does so while still not requiring separate thread contexts orspecial hardware/ﬁrmware support. It makes the implementationand debugging of the helper easier since it avoids explicit synchro-nization and stays a constant number of iterations ahead of themain computation, As a software approach that does not requirehigh level source code, it can be attractive for third party cloudapplications. R EFERENCES

The Boost Graph Library: User Guide and Reference Manual . Boston,MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2002.[3] “Intel C++ Compiler Professional Edition 11.1 for Linux Installa-tion Guide and Release Notes,” https://software.intel.com/content/dam/develop/external/us/en/documents/release-notesc-l-en-779487.pdf, Sep.2009, [Online; accessed 18-June-2020].[4] A.-R. Adl-Tabatabai, R. L. Hudson, M. J. Serrano, and S. Subramoney,“Prefetch injection based on hardware monitoring and object metadata,”in

Proceedings of the ACM SIGPLAN 2004 Conference on ProgrammingLanguage Design and Implementation , 2004.[5] S. Ainsworth and T. M. Jones, “Software prefetching for indirect memoryaccesses,” in

Proceedings of the 2017 International Symposium on CodeGeneration and Optimization , 2017.[6] ——, “An event-triggered programmable prefetcher for irregular work-loads,” in

Proceedings of the Twenty-Third International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 2018.[7] F. E. Allen, “Control ﬂow analysis,” in

Proceedings of a Symposium onCompiler Optimization , 1970.[8] G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, “Classifying memoryaccess patterns for prefetching,” in

Proceedings of the Twenty-FifthInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2020.[9] G. Ayers, N. P. Nagendra, D. I. August, H. K. Cho, S. Kanev, C. Kozyrakis,T. Krishnamurthy, H. Litz, T. Moseley, and P. Ranganathan, “Asmdb:Understanding and mitigating front-end stalls in warehouse-scale comput-ers,” in

Proceedings of the 46th International Symposium on ComputerArchitecture , 2019.[10] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu,“Hmtt: A platform independent full-system memory trace monitoringsystem,”

SIGMETRICS Perform. Eval. Rev. , vol. 36, no. 1, Jun. 2008.[11] S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of main memoryhash join algorithms for multi-core cpus,” in

Proceedings of the 2011ACM SIGMOD International Conference on Management of Data ∼ jignesh/multijoin.tar.bz2.[12] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt,“Simultaneous subordinate microthreading (ssmt),” in Proceedings of the26th Annual International Symposium on Computer Architecture , 1999.[13] R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt, “Microarchitecturalsupport for precomputation microthreads,” in

Proceedings of the 35thAnnual ACM/IEEE International Symposium on Microarchitecture , 2002.[14] T. M. Chilimbi and M. Hirzel, “Dynamic hot data stream prefetchingfor general-purpose programs,” in

Proceedings of the ACM SIGPLAN2002 Conference on Programming Language Design and Implementation ,2002.[15] J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen, “Dynamic spec-ulative precomputation,” in

Proceedings of the 34th Annual ACM/IEEEInternational Symposium on Microarchitecture , 2001.[16] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery,and J. P. Shen, “Speculative precomputation: Long-range prefetchingof delinquent loads,” in

Proceedings of the 28th Annual InternationalSymposium on Computer Architecture , 2001.[17] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,

Introduction toAlgorithms

Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis , 2016.[20] M. Dubois, “Fighting the memory wall with assisted execution,” in

Proceedings of the 1st Conference on Computing Frontiers , 2004.[21] M. Dubois and Y. Song, “Assisted execution,”

University of SouthernCalifornia CENG Technical Report , vol. 98, no. 25, 1998.[22] B. Falsaﬁ and T. F. Wenisch,

A Primer on Hardware Prefetching . Morgan& Claypool Publishers, 2014.[23] A. Garg and M. C. Huang, “A performance-correctness explicitly-decoupled architecture,” in

Proceedings of the 41st Annual IEEE/ACMInternational Symposium on Microarchitecture , 2008.

24] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure,dynamics, and function using networkx,” in

Proceedings of the 7th Pythonin Science Conference , 2008.[25] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous runahead: Transparenthardware acceleration for memory intensive workloads,” in

The 49thAnnual IEEE/ACM International Symposium on Microarchitecture , 2016.[26] U. Holzle, “Brawny cores still beat wimpy cores, most of the time,”

IEEEMicro , 2010.[27] A. Jain and C. Lin, “Linearizing irregular memory accesses for improvedcorrelated prefetching,” in

Proceedings of the 46th Annual IEEE/ACMInternational Symposium on Microarchitecture

SIAM Journal on Computing , vol. 4, no. 1, 1975.[30] D. Joseph and D. Grunwald, “Prefetching using markov predictors,” in

Proceedings of the 24th Annual International Symposium on ComputerArchitecture , 1997.[31] C. Jung, D. Lim, J. Lee, and Y. Solihin, “Helper thread prefetching forloosely-coupled multiprocessor systems,” in

Proceedings of the 20th IEEEInternational Parallel Distributed Processing Symposium , 2006.[32] M. Kamruzzaman, S. Swanson, and D. M. Tullsen, “Inter-core prefetchingfor multicore processors using migrating helper threads,” in

Proceedingsof the Sixteenth International Conference on Architectural Support forProgramming Languages and Operating Systems , 2011.[33] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley,G.-Y. Wei, and D. Brooks, “Proﬁling a warehouse-scale computer,” in

Proceedings of the 42nd Annual International Symposium on ComputerArchitecture , 2015.[34] D. Kim, S. S. wei Liao, P. H. Wang, J. del Cuvillo, X. Tian, X. Zou,H. Wang, D. Yeung, M. Girkar, and J. P. Shen, “Physical experimentationwith prefetching helper threads on intel’s hyper-threaded processors,” in

Proceedings of the International Symposium on Code Generation andOptimization: Feedback-directed and Runtime Optimization , 2004.[35] D. Kim and D. Yeung, “Design and evaluation of compiler algorithmsfor pre-execution,” in

Proceedings of the 10th International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 2002.[36] B. Korel and J. Laski, “Dynamic program slicing,”

ACM Transactions onArchitecture and Code Optimization , vol. 29, no. 3, Oct. 1988.[37] M. Lam, “Software pipelining: An effective scheduling technique for vliwmachines,” in

Proceedings of the ACM SIGPLAN 1988 Conference onProgramming Language Design and Implementation , 1988.[38] J. Lee, H. Kim, and R. Vuduc, “When prefetching works, when it doesn’tand why,”

ACM Transactions on Architecture and Code Optimization ,vol. 9, no. 1, Mar. 2012.[39] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: Densiﬁca-tion laws, shrinking diameters and possible explanations,” in

Proceedingsof the Eleventh ACM SIGKDD International Conference on Knowl-edge Discovery in Data Mining , 2005, http://snap.stanford.edu/data/cit-Patents.html.[40] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Communitystructure in large networks: Natural cluster sizes and the absence of largewell-deﬁned clusters,”

Internet Mathematics , vol. 6, no. 1, Jan 2009,http://snap.stanford.edu/data/web-Google.html.[41] S. S. Liao, P. H. Wang, H. Wang, G. Hoﬂehner, D. Lavery, andJ. P. Shen, “Post-pass binary adaptation for software-based speculativeprecomputation,” in

Proceedings of the ACM SIGPLAN 2002 Conferenceon Programming Language Design and Implementation , 2002.[42] J. Lu, A. Das, W.-C. Hsu, K. Nguyen, and S. G. Abraham, “Dynamichelper threaded prefetching on the sun ultrasparc cmp processor,” in

Proceedings of the 38th Annual IEEE/ACM International Symposium onMicroarchitecture , 2005.[43] C.-K. Luk, “Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors,” in

Proceedingsof the 28th Annual International Symposium on Computer Architecture ,2001.[44] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace,V. J. Reddi, and K. Hazelwood, “Pin: Building customized programanalysis tools with dynamic instrumentation,” in

Proceedings of the2005 ACM SIGPLAN Conference on Programming Language Design andImplementation , 2005.[45] C.-K. Luk and T. C. Mowry, “Compiler-based prefetching for recursivedata structures,” in

Proceedings of the Seventh International Conferenceon Architectural Support for Programming Languages and OperatingSystems , 1996. [46] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducingthe graph 500,” Cray User’s Group, May 2010.[47] O. Mutlu, Hyesoon Kim, and Y. N. Patt, “Address-value delta (avd)prediction: increasing the effectiveness of runahead execution by ex-ploiting regular memory allocation patterns,” in , 2005.[48] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for efﬁcient processingin runahead execution engines,” in

Proceedings of the 32nd AnnualInternational Symposium on Computer Architecture , 2005.[49] K. J. Nesbit and J. E. Smith, “Data cache prefetching using a globalhistory buffer,” in

Proceedings of the 10th International Symposium onHigh Performance Computer Architecture , 2004.[50] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citationranking: Bringing order to the web,” Tech. Rep. 1999-66, November 1999.[51] R. Pagh and F. F. Rodler, “Cuckoo hashing,”

J. Algorithms , vol. 51, no. 2,May 2004.[52] V.-M. Panait, A. Sasturkar, and W.-F. Wong, “Static identiﬁcation ofdelinquent loads,” in

Proceedings of the International Symposium onCode Generation and Optimization: Feedback-directed and RuntimeOptimization , 2004.[53] H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie, “Pinplay: Aframework for deterministic replay and reproducible analysis of parallelprograms,” in

Proceedings of the 8th Annual IEEE/ACM InternationalSymposium on Code Generation and Optimization , 2010.[54] L. Peled, S. Mannor, U. Weiser, and Y. Etsion, “Semantic locality andcontext-based prefetching using reinforcement learning,” in

Proceedingsof the 42nd Annual International Symposium on Computer Architecture ,2015.[55] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A study of slipstreamprocessors,” in

Proceedings of the 33rd Annual ACM/IEEE InternationalSymposium on Microarchitecture , 2000.[56] B. R. Rau and C. D. Glaeser, “Some scheduling techniques and an easilyschedulable horizontal architecture for high performance scientiﬁc comput-ing,” in

Proceedings of the 14th Annual Workshop on Microprogramming ,1981.[57] A. Roth, A. Moshovos, and G. S.Sohi, “Dependence based prefetchingfor linked data structures,” in

Proceedings of the Eighth InternationalConference on Architectural Support for Programming Languages andOperating Systems , 1998.[58] A. Roth and G. S. Sohi, “Effective jump-pointer prefetching for linked datastructures,” in

Proceedings of the 26th Annual International Symposiumon Computer Architecture , 1999.[59] ——, “Speculative data-driven multithreading,” in

Proceedings of theSeventh International Symposium on High-Performance Computer Archi-tecture , 2001.[60] P. Rousseeuw, “Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis,”

Journal of Computational and AppliedMathematics , vol. 20, no. 1, Nov. 1987.[61] F. J. Sanchez and A. Gonzalez, “Cache sensitive modulo scheduling,” in

Proceedings of the 30th Annual ACM/IEEE International Symposium onMicroarchitecture , 1997.[62] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automaticallycharacterizing large scale program behavior,” in

Proceedings of the 10thInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2002.[63] M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H.Pugsley, and Z. Chishti, “Efﬁciently prefetching complex address patterns,”in

Proceedings of the 48th International Symposium on Microarchitecture ,2015.[64] A. Silberschatz, H. Korth, and S. Sudarshan,

Database Systems Concepts ,6th ed. New York, NY, USA: McGraw-Hill, Inc., 2010.[65] A. Sriraman, A. Dhanotia, and T. F. Wenisch, “Softsku: Optimizing serverarchitectures for microservice diversity @scale,” in

Proceedings of the46th International Symposium on Computer Architecture , 2019.[66] X.-H. Sun and D. Wang, “Concurrent average memory access time,”

Computer , vol. 47, no. 5, May 2014.[67] D. M. Tullsen and J. A. Brown, “Handling long-latency loads in asimultaneous multithreading processor,” in

Proceedings of the 34th AnnualACM/IEEE International Symposium on Microarchitecture , 2001.[68] P. H. Wang, J. D. Collins, H. Wang, D. Kim, B. Greene, K.-M. Chan, A. B.Yunus, T. Sych, S. F. Moore, and J. P. Shen, “Helper threads via virtualmultithreading on an experimental itanium R (cid:13) Proceedings of the 11th International Conference on ArchitecturalSupport for Programming Languages and Operating Systems , 2004.[69] C. Wellons, “Raw linux threads via system calls,” https://nullprogram.com/blog/2015/05/15/, 2015, [Online; accessed 05-August-2019].

70] H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and C. Lin, “Tem-poral prefetching without the off-chip metadata,” in

Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture ,2019.[71] A. Yasin, “A top-down method for performance analysis and countersarchitecture,” in , 2014.[72] X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “Imp: Indirect memoryprefetcher,” in

Proceedings of the 48th International Symposium onMicroarchitecture , 2015.[73] Y. Yuan, Y. Wang, R. Wang, and J. Huang, “Halo: Accelerating ﬂow classiﬁcation for scalable packet processing in nfv,” in

Proceedings of the46th International Symposium on Computer Architecture , 2019.[74] Y. Zhang, D. Meisner, J. Mars, and L. Tang, “Treadmill: Attributingthe source of tail latency through precise load testing and statisticalinference,” in

Proceedings of the 43rd International Symposium onComputer Architecture , 2016.[75] C. Zilles and G. Sohi, “Execution-based prediction using speculativeslices,” in

Proceedings of the 28th Annual International Symposium onComputer Architecture , 2001., 2001.