Observing the Invisible: Live Cache Inspection for High-Performance Embedded Systems
Dharmesh Tarapore, Shahin Roozkhosh, Steven Brzozowski, Renato Mancuso
11 ”This work has been submitted to the IEEE for possiblepublication. Copyright may be transferred without notice,after which this version may no longer be accessible.” a r X i v : . [ c s . OH ] J u l Observing the Invisible: Live Cache Inspectionfor High-Performance Embedded Systems
Dharmesh Tarapore ∗ , Shahin Roozkhosh ∗ , Steven Brzozowski ∗ and Renato Mancuso ∗∗ Boston University, USA { dharmesh, shahin, sbrz, rmancuso } @bu.edu Abstract —The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exactbehavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, anunderstanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay amongapplications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding therelegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchysimulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions.In this paper, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complementsextant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. Weleverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware.We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 SoC(system on chip).
Index Terms —cache, cache snapshotting, ramindex, cacheflow, cache debugging (cid:70)
NTRODUCTION
The burgeoning demand for high-performance embeddedsystems among a diverse range of applications such astelemetry, embedded machine vision, and vector process-ing has outstripped the capabilities of traditional micro-controllers. For manufacturers, this has engendered a dis-cernible shift to system-on-chip modules (SoCs). Coupledwith their extensibility and improved mean time betweenfailures, SoCs offer improved reliability and functional-ity. To bridge the gap between increasingly faster CPUspeeds and comparatively slower main memory technolo-gies (e.g. DRAM) most SoCs feature cache-based archi-tectures. Indeed, caches allow modern embedded CPUsto meet the performance requirements of emerging data-intensive workload. At the same time, the strong needfor predictability in embedded applications has renderedanalyses of caches and their contents vital for system design,validation, and certification.Unfortunately, this interest runs counter to the generaldesire to abstract complexity. Cache mechanisms and poli-cies are ensconced entirely in hardware, to eliminate soft-ware interference and encourage portability. Consequently,software-based techniques used to study caches suffer fromseveral shortcomings, which we detail in Section 5.5.In contrast, we propose CacheFlow: a technique thatcan be implemented in software and in existing high-performance SoCs to extract and analyze the contents ofcache memories. CacheFlow can be deployed in a livesystem without the need for an external hardware debugger.By periodically sampling the cache state, we show that wecan reconstruct the behavior of multiple applications in thesystem; observe the impact of scheduling policies; and studyhow multi-core contention affects the composition of cached content.Importantly, our technique is not meant to replace othercache analysis approaches. Rather, it seeks to supplementthem with insights on the exact behavior of applicationsand system components that were not previously possi-ble. While in this work we specifically focus on last-level(shared) cache analysis, the same technique can be used toprofile private cache levels, TLBs, and the internal states ofcoherence controllers. In summary, we make the followingcontributions:1) This is the first paper to describe in detail an inter-face, namely
RAMINDEX , available on modern embeddedCPUs that can be used to inspect the content of CPUcaches. Despite its usefulness, the interface has receivedlittle to no attention in the research community thus far;2) We present a technique called CacheFlow to performcache content analysis via event-driven snapshotting;3) We demonstrate that the technique can be implementedon modern hardware by leveraging the
RAMINDEX inter-face and propose a proof-of-concept open-source Linuximplementation
4) We describe how to correlate information retrieved viacache snapshotting to user-level and kernel softwarecomponents deployed on the system under analysis;5) We evaluate some of the insights provided by the pro-posed technique using real and synthetic benchmarks.The rest of this paper is organized as follows: in Sec-tion II, we document related research that inspired andinformed this paper. In Sections III and IV, we provide abird’s-eye view of the concepts necessary to understand themechanics of CacheFlow. In Sections V and VI, we detail the
1. Our implementation can be found at: https://github.com/weirdindiankid/cacheflow fundamentals of CacheFlow and its implementation. SectionVII outlines the experiments we performed and documentstheir results. We also examine the implications of thoseresults. Section VIII concludes the paper with an outlookon future research.
ELATED W ORK
Caches have a significant impact on the temporal behav-ior of embedded applications. But their design—orientedtoward programming transparency and average-case opti-mization—makes performance impact analysis difficult. Aplethora of techniques have approached cache analysis frommultiple angles. We hereby provide a brief overview of theresearch in this space.
Static Cache Analysis derives bounds on the accesstime of memory operations when caches are present [1]–[3]. Works in this domain study the set of possible cachestates in the control-flow graph (CFG) of applications. Ab-stract interpretation is widely employed for static cacheanalysis, as first proposed in [4] and [5]. For static analysisto be carried out, a precise model of the cache behavioris required. Techniques that consider Least-Recently Used(LRU), Pseudo-LRU, and FIFO replacement policies [3], [6]–[9] have been studied.
Symbolic Execution is a software technique for feasiblepath exploration and WCET analysis [10], [11] of a programsubject to variable input vectors. It proffers a middle groundbetween simulation and static analysis. An interpreter fol-lows the program; if the execution path depends on anunknown, a new symbolic executor is forked. Each symbolicexecutor stands for many actual program runs whose actualvalues satisfy the path condition.As systems grow more complex,
Cache Simulation toolsare essential to study new designs and evaluate existingones. Simulation of the entire processor — including cores,cache hierarchy, and on-chip interconnect — was proposedin [12]–[15]. Simulators that only focus on the cache hierar-chy were studied in [16]–[18]. Depending on the componentunder analysis, simulations abound. In the (i) execution-driven approach, the program to be traced runs locally onthe host platform; in the (ii) emulation-driven approach, thetarget program runs on an emulated platform and environ-ment created by the host; finally, in the (iii) trace-drivenapproach, a trace file generated by the target applicationis fed into to the simulator. An excellent survey reviewing28 CPU cache simulators was published by Brais et. al [19].The most popular is perhaps Cachegrind that belongs to theValgrind Suite [20].
Statistic Profiling is performed by leveraging perfor-mance monitoring units (PMUs) integrated in modern pro-cessors. PMUs can monitor a multitude of hardware eventsthat occur as applications execute on the platform. Unlikethe aforementioned strategies, sampling the PMU providesinformation on the real behavior of the hardware platform.As such, a number of works have used statistic profiling tostudy memory-related performance issues [21]–[23]. Highlevel libraries such as PAPI [24], [25], Likwid [26] andnumap [27] provide a set of APIs to ease the use of PMUs.Despite the seminal results achieved in the last decadein the cache analysis and profiling techniques described thus far, a few important limitations are worth noting.Techniques that rely on cache models — i.e. static anal-ysis, symbolic execution, simulation — work under theassumption that the employed models accurately representthe true behavior of the hardware. Unfortunately, complexmodern hardware often deviates from textbook models inunpredictable ways. Access to event counters subsequentlyonly reveals partial information on the actual state of thecache hierarchy.The technique proposed in this paper is meant to com-plement the analysis and profiling strategies reviewed thusfar. It does so by allowing system designers to snapshot theactual content of CPU caches. This, in turn, enables a newset of strategies to extract/validate hardware models, or toconduct application and system-level analysis on the utiliza-tion of cache resources. Unlike works that proposed cachesnapshotting by means of hardware modifications [28]–[30]our technique can be entirely implemented in software andleverages hardware support that already exists in a broadline of high-performance embedded CPUs.
ACKGROUND
In this section, we introduce a few fundamental conceptsrequired to understand CacheFlow’s inner workings. First,we review the structure and functioning of multi-level set-associative caches. Next, we briefly explore the organizationof virtual memory in the target class of SoCs.
Modern high-performance embedded processors imple-ment multiple levels of caching. The first level (L1) is theclosest to the CPU and its contents are usually private tothe local processor. Cache misses in L1 trigger a look-up inthe next cache level (L2), which can be still private, sharedamong a subset (cluster) of cores, or globally shared by allthe cores. Additional levels (L3, L4, etc.) may also be present.The last act before a look-up in the main memory is toquery the
Last-Level Cache (LLC). Without loss of generality,and to be more in line with our implementation, we considerthe typical cache layout of ARM-based processors. That is,we assume private L1 caches and a globally shared L2,which is also the LLC.
Set-associativity:
Caching at any level typically followsa set-associative scheme. A set-associative cache with associa-tivity W , features W ways, where each way has an identicalstructure. A cache with total size C S is thus structured in W ways of size W S = C S /W each. Caches store multipleblocks of consecutive bytes in cache lines (a.k.a. cache blocks ).We use L S to indicate the number of bytes in each cache line.Typical line sizes are 32 or 64 bytes. S = W S /L S denotes thenumber of lines in a way or sets in a cache.When the CPU accesses a cacheable memory location,the memory address determines how the cache look-up (orallocation) is performed. In the case of a cache miss, theleast-significant bits of the address encode the specific byteinside the cache line and are called offset bits. For instance,in systems with L S = 64 bytes, the offset bits are [5:0].The second group of bits in the memory address encodesthe specific cache set in which the memory content can be cached. Since we have S possible sets, the next log S bitsafter the offset bits select one of the possible sets and arecalled index bits. Finally, the remaining bits, called tag bits,are stored alongside cached content to detect cache hits afterlook-up. Virtual and Physical Caches:
Addresses used for cachelook-ups can be physical or virtual. In the vast majority ofembedded multi-core systems, tag bits are physical addressbits signifying a physically-tagged cache. In shared caches(e.g. L2), index bits are also derived from physical ad-dresses. For this reason, they are said to be physically-indexed , physically-tagged (PIPT) caches [31]–[35]. Virtual Memory
Modern computing architectures rely onsoftware and hardware support to map virtual addressesused by processes to physical addresses represented in thehardware. Giving processes distinct views of the system’smemory obviates problems stemming from fragmentationor limited physical memory. The operating system man-ages the translation between virtual and physical addressesthrough a construct known as the page table . When a processtries to reference a virtual address, the operating system firstchecks for that address’ presence in the referencing process’address space. If present, the system then checks for thepage’s presence in memory via its page table entry (PTE).From there, the address is either resolved to a physicaladdress, or the page is brought into main memory and thenresolved.
Virtual Memory Areas : When a process terminates inLinux, the kernel is tasked with freeing the process’ mem-ory mappings. Older versions of Linux accomplished thisby iterating through a chain of reverse pointers to pagetables that referenced each physical frame. The mountingintractability of this approach spurred the development of virtual memory areas , or VMAs.VMAs are contiguous regions of virtual memory rep-resented as a range of start and end addresses. They sim-plify the kernel’s management of a process’ address space,thus facilitating granular control permissions on a per-VMAbasis. VMAs record frame to page mappings on a per-VMA basis (as opposed to on a per-frame basis) and wereincorporated into the kernel, as of version 2.6 [36]. HE RAMINDEX I NTERFACE
The
RAMINDEX interface was originally introduced on theARM Cortex-A15 [34] family of CPUs and is currently avail-able on the high-performance line of ARM embedded pro-cessors. These include ARM Cortex A15 [34], ARM CortexA57 [32], ARM Cortex A76 [37], and the recently announcedARM Cortex A77 [38]. Table 1 reviews the availability ofthe
RAMINDEX interface across ARM CPUs and provides afew notable examples of known SoCs equipped with suchCPUs. Given the consistent support for the
RAMINDEX in-terface across the high-performance line of ARM processors,there is good indication that
RAMINDEX will continue to besupported in future families of CPUs.We now make specific reference to the
RAMINDEX inter-face available on ARM Cortex-A57 CPUs and used in our
TABLE 1Availability of
RAMINDEX in ARM Cortex-A CPUs.
CPU Max. Freq. Release Privilege Level Notable SoCs
Cortex-A9 (32-bit) 800 MHz 2007 Not supported Xilinx Zynq-7000Nvidia Tegra, 2, 3, 4iCortex-A15 (32-bit) 2.5 GHz 2011 EL1 or Higher Nvidia Tegra 4, K1MediaTek MT8135/VCortex-A53 (64-bit) 1.5 GHz 2012 Not supported Intel Stratix 10Xilinx ZynqMPCortex-A57 (64-bit) 1.6 GHz 2012 EL1 or Higher Nvidia Tegra X1Nvidia Tegra X2Cortex-A72 (62-bit) 2.5 GHz 2015 EL1 or Higher MediaTek Helio X2x, MT817xXilinx VersalCortex-A76 (64-bit) 3 GHz 2018 EL3 MediaTek Helio G90Cortex-A77 (64-bit) 3.3 GHz 2019 EL3 MediaTek Dimensity 1000 experiments. These CPUs belong to the ARMv8-A archi-tecture and support two main modes of operations: 64-bitmode (AArch64) and 32-bit mode (AArch32). We base ourdiscussion on target machines operating in AArch64 mode.
RAMINDEX operates on a set of 10 32-bit wide systemregisters which are local to each of the processing cores.These registers are read via the move from system register ( mrs ) instruction. Similarly, write operations can be per-formed using the move to system register ( msr ) instruction.The main interface is the eponymous RAMINDEX register.In a nutshell, the CPU writes a command to the
RAMINDEX register by specifying (1) the target memory to be accessed,and (2) appropriate coordinates to access the target memory.The result of the command is then available in the remaining9 registers and can be read. Of these, the first group of regis-ters named
IL1DATA n _EL1 , with n ∈ { , . . . , } holds theresult of any operation that accesses the instruction L1 cache.The second group of registers, namely DL1DATA n _EL1 with n ∈ { , . . . , } , holds the result of accesses performed toL1 or L2 cache data entries. The resources that can beaccessed through the RAMINDEX interface (see [32] for exactdetails of the valid commands) include (1) L1 instructionand data cache content, both tag and data memories; (2)L1 instruction cache branch predictor and indirect jumppredictor memories ; (3) L1 instructions and data translationlook-aside buffers (TLBs); (4) L2 tag and data memories; (5)L2 TLB; (6) L2 Error Correction Code (ECC) memory; and(7) L2 snoop-control memory yielding information aboutthe Modified/Owned/Exclusive/Shared/Invalid (MOESI)state of each cache line.As can be noted, the coverage of the RAMINDEX interfaceis quite extensive. In this paper we decided to focus specif-ically on the L2 cache which is shared among all the cores.Because we are specifically interested in the relationshipbetween cache state and memory owned by applications,we program the
RAMINDEX interface to access the L2 tagmemory. The coordinates to retrieve the content of a specificentry are expressed in terms of (1) cache way number, and(2) cache set number to be accessed. Because the L2 is aPIPT cache, the set number corresponds to the index bits ofa physical address. The content of the tag memory returnedon the
DL1DATA n _EL1 registers contains the remaining bitsof the physical address cached at the specified coordinates,and an indication that the entry is indeed valid.
2. These are three different memory resources, i.e. the indirect pre-dictor memory, the Branch Target Buffer (BTB) and the Global HostoryBuffer (GHB).
Fig. 1. Trigger and Shutter modules operation over time in synchronous(a) and asynchronous (b) mode with periodic sampling.
Among the four privilege levels E0-E3 of ARMv8(-A), onARM Cortex-A57 and Cortex-A72 CPUs,
RAMINDEX is avail-able at all exception levels, excluding EL0, i.e. the lowestprivilege level (see Table 1). This has important implications.For example,
RAMINDEX could be used to expose memoryand cache contents of other guest operating systems sharingthe same hardware. For this reason, while in ARM Cortex-A57, the
RAMINDEX is accessible starting from EL1, it ap-pears that it is gradually being elevated in the privilegelevel. For instance, in Cortex-A77, EL3 (Trust-Zone securitymonitor level) is required.
ACHE F LOW O VERVIEW
In this section we discuss the general workflow of theproposed CacheFlow technique. We first provide a high-level description of the different moving parts; we thendescribe the main challenges faced and the possible usagescenarios for the proposed CacheFlow.CacheFlow is structured in two modular components.The first module, namely the
Shutter concerns the low-levellogic that leverages the
RAMINDEX interface described inSection 4. The Shutter is responsible for initiating a snapshotof the content of the target memory — e.g. L1 data/tag,L2 data/tag, TLBs, etc. The Shutter is implemented as anOS component and hence runs with kernel-level privileges.It exposes two interfaces to the rest of the system: (1) aninterface to configure and initiate snapshot acquisition; and(2) a data channel where the content of acquired snapshotsis passed to user-space for final processing.The second module is the
Trigger that implements user-level logic to commandeer a new snapshot to be performedby the Shutter. The Trigger is designed to support a numberof event-based activation strategies. For instance, to performperiodic sampling of the cache content, the Trigger is acti-vated via timer events. Alternatively the Trigger can be acti-vated when the application under analysis reaches a code ordata breakpoint, invokes a system call, or delivers a specificPOSIX signal to the Trigger process. Periodic activation, andevent-driven activation initiated through signal delivery arecurrently implemented.
CacheFlow can operate in a number of modes to facilitatedifferent types of cache analyses. A mode is selected byappropriately configuring the Shutter and Trigger modules.A more in-depth discussion is provided in Section 6. Here,we provide a short overview of the most important modes.First, CacheFlow can operate in flush or transparent mode.When operating in flush mode, the acquisition of a snap-shot is intentionally destructive to cache contents. In thismode, after acquiring a snapshot, the cache contents of theapplication(s) under analysis are flushed from the cache.Conversely, when operating in transparent mode, cachesnapshotting is performed while minimizing the impacton the contents of the cache. We quantify the involuntarypollution overhead when operating in transparent mode inSection 7.3. CacheFlow can also operate in synchronous or asynchronous mode. Synchronous mode is best suited to analyzing a spe-cific subset of applications executing in parallel on multiplecores. In this mode, the Trigger spawns the applicationsunder analysis and delivers POSIX signals to pause themonce a new snapshot acquisition is initiated, and to resumethem afterwards. Figure 1 (a) provides a timeline of eventsas they occur in the synchronous mode. When a new snap-shot is to be acquired (dashed blue up-arrow in the figure),all the tasks under analysis are paused (dashed red down-arrows). Once the acquisition of the current snapshot iscomplete, all the observed applications are resumed (dashedblue down-arrows). This mode ensures that all the observedapplications — including the one executing on the sameCPU as the Trigger— are equally affected by the activationof the trigger. The extra complexity of pause/resume sig-nals is unnecessary (1) when a single application is beingobserved, pinned to the same core as the Trigger; and (2)when one is not conducting an analysis on a specific set ofapplications, but, for instance, on the the background noise ofsystem services.To cover the latter two cases, CacheFlow can operatein asynchronous mode. In this case, there is no explicitpause/resume signal delivery to applications, as depicted inFigure 1 (b). Upon activation (dashed up-arrow), the Triggerpreempts all the applications on the same core but does notexplicitly pause applications on other cores. It then invokesthe Shutter.Regardless of the mode, note that the Shutter temporar-ily spins all the cores (solid down-arrow) once invoked toperform the low-level interaction with RAMINDEX registers.This is necessary to ensure the correctness of the snapshot,as discussed in Section 6.3 and highlighted in Figure 1.
Three key challenges have been addressed in the proposeddesign of CacheFlow, which are hereby summarized. Moredetails on how each challenge was solved are provided inSection 6.
Avoiding Pollution:
The first challenge we faced is quiteintuitive. Acquiring a snapshot of the cache involves the execution of logic on the very same system we are tryingto observe. Worse yet, while the content of the cache isprogressively read, one must use a memory buffer to storethe resulting data. But writes into the buffer might triggercache allocations and hence pollute the state of the cachethat is being sampled. Because the size of the used bufferneeds to be in the same order of magnitude as the size ofthe cache, this issue can significantly impact the validity ofthe snapshot.To solve this challenge, we statically reserve a portionof main memory used by CacheFlow to acquire a snapshot.The memory region reserved for CacheFlow is marked asnon-cacheable. For CacheFlow operating in flush mode, it isonly necessary to reserve enough memory for a single snap-shot. This space is reused for subsequent snapshots. Con-versely, to operate in transparent mode, enough memory tostore all the snapshots required for the current experimentneeds to be reserved. We refer the reader to Section 6 formore details on the size of each snapshot. With this setup,the Shutter minimizes snapshot pollution by performing aloop of a few instructions that (1) iterate through all thecache ways and sets to be read; (2) operate exclusively onCPU registers; and (3) only perform memory stores towardthe non-cacheable buffer.
Pausing Progress:
Capturing a snapshot can take a non-negligible amount of time. While a snapshot capture is inprogress, it is important to ensure that the applicationsunder analysis do not progress in their execution. In otherwords, the Shutter should be able to temporarily freeze allthe running applications and resume their execution oncethe capture operation is complete. Not doing so wouldresult in snapshots that do not reflect a real cache state.This is because the state of the cache would be continuouslychanging while the capture is still in progress.On a single-core implementation, it is enough to run theShutter with interrupts and preemption disabled to ensurethat the application under analysis does not continue toexecute while a capture operation is in progress. But thisis not sufficient in a multi-core implementation. To solve theproblem, we first designate a master core responsible forcompleting the capture operation. Next, we use kernel-levelinter-core locking primitives to temporarily stall all the othercores. Once the snapshot has been acquired, all the othercores are released and resume normal execution.
Inferring Content Ownership:
As recalled in Section III,shared caches — like the L2 targeted in our implementation— are generally PIPT. As such, when a snapshot is captured,we obtain a list of physical address tags. A first importantstep consists in reconstructing the full physical addressgiven the obtained tag bits and the cache index bits usedto retrieve each tag. The end goal of our analysis, however,is to attribute captured cache lines to running applicationsor OS-level components. This step is strictly dependent inthe strategy used by the OS/platform under analysis tomap applications’ virtual addresses to physical memory. Wedistinguish three cases.The first and simplest case corresponds to (RT)OS’s op-erating on small micro-controller that do not have supportfor virtual memory, i.e. where no MMU is present. Thesesystems usually feature a Memory Protection Unit (MPU)that allows defining permission regions for ranges of phys- ical addresses. Both applications and OS components arethen directly compiled against physical memory addresses.In this case, ownership of cache blocks can be inferred bysimply comparing the obtained physical addresses w.r.t. theglobal system memory map.The second case corresponds to systems where, albeitan MMU exists, it is configured to perform a flat linearmapping between virtual addresses and physical addresses.In this case there exists a (potentially null) constant offsetbetween virtual addresses and corresponding physical ad-dresses.The third scenario corresponds to OS’s that use demandpaging. In this case, there is no fixed mapping betweenvirtual pages assigned to applications and physical mem-ory. In this case, contiguous pages in virtual memory arearbitrarily mapped to physical memory, following the OS’sinternal memory allocation scheme. With demand paging,applications are initially given a virtual addressing space.Only when the application “touches” a virtual page, a newphysical page is allocated from a pool of free pages. InCacheFlow we consider this case because it represents themost general and challenging scenario.
We envision that, in addition to the use cases directly ex-plored in this paper, CacheFlow and variations of our tech-nique can be employed in a number of use cases, includingbut not limited to the following: (1) To study the heat-mapof cached content over time and understand the evolutionof the active memory working set of user applications. (2)To study the cache footprint of specific functions within anapplication, a library or at the level of system calls. (3) In asingle-core setup with multiple co-running applications, tostudy how scheduling decisions and preemptions affect thecomposition of cache content over time. (4) In a multi-coresetup with multiple tasks running in parallel to study thecontention over cache resources. (5) To debug and validateset-based (e.g. coloring) and way-based cache partitioningtechniques. (6) To validate hypotheses on the expectedbehavior of a multi-level cache hierarchy, e.g., in termsof inclusiveness, replacement policy, employed coherencyprotocol, prefetching strategies. (7) To asses the vulnerabilityof a machine to cache-based side channel attacks arisingfrom speculative execution and memory traffic.
CacheFlow offers a novel method to study caches, wheretraditionally hardware debuggers and simulation modelshave been used. System designers have traditionally re-sorted to hardware debuggers to inspect the contents ofcache memories, using them as a proxy to study correctsystem behavior and explain applications’ performance. Themain advantage of using an external hardware debuggerto inspect the state of caches is that the impact of thedebugger on the cache itself can be kept to a minimum.But making sense out of a cache snapshot requires accessto OS-level data structures such as page tables and VMAlayouts, to name a few. Debuggers that provide some ofthe cache analysis features provided by CacheFlow rely onhigh-bandwidth trace ports — as opposed to traditional
JTAG ports — often unavailable in production systems.The Lauterbach PowerTrace II and the ARM DS-5 withthe DSTREAM adapter are examples of solutions that canprovide snapshots of cache contents. Their price tag exceedsUSD 6,000 . While in principle an inexpensive JTAG de-bugger could be used to halt the CPUs, interact with the RAMINDEX interface, perform physical → virtual translationand application layout resolution, no such implementationexists to the best of our knowledge.In contrast, CacheFlow runs entirely in software, im-poses minimal system overhead, does not requires the ex-istence of a debug port nor extra hardware, and can runon most machines with support for the RAMINDEX interface,while managing to provide much of the same information ashardware debuggers with minimal effort. CacheFlow’s mostobvious shortcoming is some inevitable overhead, in termsof cache pollution, compared to hardware debuggers .Another method often used to perform cache profiling issimulation. Unfortunately, simulation models are often toogeneric to capture implementation-specific design choices.Gem5 [12], for example, only simulates a generic cachemodel, which may not be in match with the behavior of theactual hardware. It is also challenging to simulate entire sys-tems with production-like setups in terms of active systemservices, active I/O devices and concurrent applications.Conversely, CacheFlow can be used to observe the behaviorof a system in the field, and/or to validate and refineplatform-specific cache simulation models.Yet another class of cache analysis approaches arebased on performance-counter sampling. These only pro-vide quantitative information on system-wide metrics thatare best interpreted with a good understanding of the micro-architecture at hand. In comparison, CacheFlow, providesbehavioral information about the cache that is akin to whata hardware debugger could provide.An additional benefit CacheFlow offers compared to theaforementioned approaches is its relative versatility aproposdeployability. Since it relies exclusively on RAMINDEX andLinux’s scaffolding for building and loading kernel mod-ules, CacheFlow serves as an excellent candidate for remotedeployment. On virtual private servers (VPS), for instance,CacheFlow can provide information that developers wouldtraditionally rely on debuggers for, without necessitatingphysical access to the system. As such, CacheFlow’s valuelies primarily in its simplicity and its reliance on ubiqui-tous support structures, both of which then engender anacceptable compromise between the effort needed to setupa hardware debugger and the loss of granularity incurredwhen using simulators.
MPLEMENTATION
Additional details about our CacheFlow implementationfollow below. We begin by describing the relevant featuresof our target SoC and then illustrate the workings of theTrigger and Shutter modules. An open-source version ofCacheFlow is available at:
URL omitted for blind submission.
Fig. 2. Logical interplay between the modules of CacheFlow and se-quence of operations performed to capture a cache snapshot.
We conducted our experiments on an NVIDIA Tegra X1SoC [39]. The TX1 chip features a cluster of four ARMCortex A57 [32] CPUs operating at a frequency of 1.9 GHz,along with four unused ARM Cortex-A53 cores. Each CPUcontains a private 48 KB L1 instruction cache and a 32 KBL1 data cache. The L2—also the LLC—is unified and sharedamong all the cores. It is implemented as a PIPT cache andemploys a random replacement policy.In terms of geometry, the L2 cache is C S = L S =
64 bytes and the associativity is W =
16. It follows, then, that the cache is divided into 2048 cachesets—each containing 16 cache lines, which in turn contain64 bytes of data each. Bits [0, 5] of a physical address markthe offset bits; bits [6, 16] are the index bits; bits [17, 43] correspond to the tag bits.The information acquired for each cache line in ourimplementation is limited to 16 bytes. Of these, 8 bytesare for the PID of the process that owns the line, andthe remaining 8 bytes encode an address field. If addressresolution is turned on, the field holds the resolved virtualaddress. If address resolution is turned off, the field is usedto store the raw physical address instead. For a 16-way set-associative cache with a line size of 64 bytes and total sizeof 2 MB, like the one considered for our evaluation, a singlesnapshot is 512 KB in size. In our setup, we have dedicated1 GB of memory for CacheFlow, meaning that for any givenexperiment, up to 2048 snapshots can be collected. With atypical snapshot period of 5-10ms, this allows studying thebehavior of applications with a runtime of 10-20 seconds. Starting from the top-level module of CacheFlow, i.e. theTrigger, we hereby review the proposed implementationfollowing the logic flow of operations provided in Figure 2.The Trigger module is always executed with the highestreal-time priority. The module is designed for event-basedactivation and hence a new snapshot is initiated when a newevent activates the Trigger, as shown in Figure 2 . In ourcurrent prototype, we implemented two activation modes:(1) periodic activation with a configurable inter-activation
5. The platform supports a 44-bit physical address space. time; and (2) event-based. In periodic mode, events aregenerated using a real-time timer set to deliver a
SIGRTMAX signal to the Trigger. The tasks under analysis are launcheddirectly by the Trigger to ensure in-phase snapshotting, andto allow the Trigger to set a specific scheduler and priorityon the spawned children processes.The Trigger implements event-based activation by ini-tiating a snapshot upon receipt of a
SIGRTMAX-1 signal.In synchronous mode, a task under analysis spawned bythe Trigger can initiate a snapshot with a combination of getppid and kill system calls. This mode was used inSection 7.7 to study the properties of the cache replacementpolicy in the target platform. The limitation of this approachis that some instrumentation in the applications’ code isrequired. Future extensions will leverage the ptrace familyof operations to allow attaching to unmodified applications,and to trigger a new snapshot upon reaching an instructionor data breakpoint.The next step depends on the mode (synchronous vs.asynchronous) in which the Trigger is configured to run.In the synchronous mode, the trigger immediately stops allthe observed tasks with a
SIGSTOP — see Figure 2 . Thisis particularly useful if multiple cores are active, but intro-duces unnecessary additional overhead when performingsingle-core analysis. Hence, this is an optional step andskipped when the Trigger operates in asynchronous mode.Next, the Trigger commandeers the acquisition of a newsnapshot to the Shutter module via a proc filesystem inter-face, as depicted in Figure 2 . If the trigger is operating inflush mode, the binary content of the snapshot is immedi-ately copied to a user-space buffer via the same interface,as shown in Figure 2 . At this point, the trigger mightcollect in-line statistics on the content of the snapshot, or(optionally) (see Figure 2 ) render the snapshot in human-readable format and store it in persistent memory for lateranalysis. This step is deferred to the end of the experimentfor all the collected snapshots if CacheFlow operates intransparent mode.Typical embedded applications make limited use ofdynamic memory and dynamically linked libraries, whichare instead commonly used features in general-purposeapplications. It follows that the virtual memory layout — i.e.the list of VMAs — of embedded applications is generallystatic. It is easy to infer to which VMA a given virtualaddress belongs to in applications with static memory lay-out. Conversely, dynamically linking libraries and allocat-ing/freeing memory at runtime can substantially change theVMA layout of an application. For this reason, the Triggeroptionally allows recording the current memory layout ofan application at the time a cache snapshot is acquired. Thisis done by reading the /proc/PID/maps interface, where PID is the process id of the application under analysis.Finally, if the Trigger is operating in synchronous mode,a
SIGCONT is sent to all the tasks under analysis as shownin Figure 2 . The Shutter is implemented as a Linux kernel module.At startup, it establishes a communication channel withuser-space by creating a new entry in the proc pseudo- filesystem. Configuration parameters and snapshot acquisi-tion commands are sent to the Shutter via ioctl calls. IfCacheFlow is operating in transparent mode, this interfaceis also used to specify which snapshot to retrieve at the con-clusion of the experiment. A read system call (Figure 2 )can be used to retrieve the selected snapshot in user-space. Inter-CPU Synchronization:
As discussed in Section 5,being able to perform a full capture of L2’s content whileminimizing pollution is of the utmost importance. For thisreason, the Shutter delivers an Inter-Processor Interrupt (IPI)to all the other CPUs, forcing them to enter a busy-waitingloop — see Figure 2 . Specifically, right before broadcastingthe IPI, the Shutter acquires a spinlock. Next, the payloadof the delivered IPI makes all the other CPUs wait on thespinlock. Pollution-free Retrieval:
While holding the spinlock,the Shutter proceeds to use the aforementioned
RAMINDEX (see Section IV) interface to access the contents of the L2tag memory, as shown in Figure 2 . Entry by entry, akernel-side buffer is filled with the result of the RAMINDEX operations — as per Figure 2 . To avoid polluting the L2 atthis step, the buffer is allocated as a non-cacheable memoryregion. To do so, we use the boot-time parameter mem torestrict the amount of main memory seen (and used) byLinux to carve out a large-enough physically-contiguousmemory buffer. This area is then mapped by the Shutter us-ing the ioremap_nocache kernel API and used to receivethe content of the L2 tag entries. From Physical to Virtual:
After all the L2 tag entrieshave been retrieved, the buffer contains a collection of phys-ical addresses, one per each cache block that was marked as valid in L2. Recall from Section 5.3 that operating systemsthat have full support for MMUs define a non-linear map-ping between virtual pages and physical memory such thata set of contiguous virtual pages is comprised of arbitrarilyscattered physical pages. Therefore, while the conversionvirtual → physical can be easily performed using page-tablewalks, the reverse translation is non-trivial. The physicaladdress resolution step depicted in Figure 2 refers to sucha reverse translation performed on each of the retrieved L2entries.To perform this resolution, we leverage Linux’s specificrepresentation of memory pages. Linux defines a descriptorof type struct page for each of the physical memorypages available in the system. The conversion fom phys-ical address to page descriptor is possible through the phys_to_page kernel macro. We first derive the pagedescriptor of the physical address to be resolved. Next,we effectively re-purpose the reverse-map interface ( rmap )used by Linux to efficiently free physical memory whenswapping is initiated. The entry point of the interface isthe rmap_walk kernel API. Given a target struct page descriptor, the procedure allows one to specify a callbackfunction to be invoked when a possible candidate for thereverse translation is found . A successful rmap_walk op-eration returns (i) a reference to the VMA that maps thepage; and (ii) the virtual address of the page inside the
6. See https://lwn.net/Articles/75198/ for more details.7. Because multiple processes might be mapping the same physicalmemory, the reverse translation is not always unique.
VMA. Importantly, the reference to the VMA allows one toderive the original memory space ( struct mm_struct );and from there, the descriptor of the process ( structtask_struct ) associated to the memory space and itsPID. After the translation step, the entries in the buffer areconverted to contain two pieces of information: (i) the PIDof the process to which the cache block belongs, and (ii)the virtual address of the block within the process’ virtualmemory space.The virtual → physical translation can be optionally dis-abled. This is useful, for instance, when profiling the cachebehavior of an application pinned to a specific subset ofphysical pages, of a different virtual machine, or of thekernel itself. From Kernel to User:
Lastly, the contents of the kernel-side buffer are copied into a user-space buffer defined inthe Trigger. On this very last step, a distinction needs tobe made because the behavior of CacheFlow significantlydiffers when it operates in flush mode, compared to trans-parent mode operation.In flush mode, the goal is to analyze what cache blocksare actively loaded by an application in between snapshots.For this reason, every snapshot acquisition is immediatelyfollowed by a copy of the snapshot to the Trigger in user-space . The Trigger also converts the binary format of thesnapshot to human-readable format and writes it to disk.This step corresponds to Figure 2 . The copy to user-space,as well as any post-processing performed by the Trigger,is conducted in cacheable memory. Because the amount ofdata moved after each snapshot is comparable in size to theL2, the post-processing acts as a tacit flush operation. How-ever, to cope with random cache replacement policies, theTrigger performs additional cache trashing before resumingthe applications to ensure that the content of the cache isindeed flushed. The presence of this flush operation is vitalto the correct interpretability of the results. By doing so,each snapshot contains only cache blocks allocated duringthe last sampling period. Therefore, the extracted content isrepresentative of the recent activity of the applications andenables active cache working-set analysis.In transparent mode, the goal is to analyze the evolutionof cache content over time while minimizing the impactof CacheFlow on the cache state. In this mode, no post-snapshot flush is performed. Thus, subsequent snapshotsare accumulated in non-cacheable memory. They are thenmoved to user-space and post-processed only at the veryend of the experiment. We evaluate in Section 7.3 how muchpollution is introduced in the various modes of operation.Because the size of a snapshot to be transferred to user-space is on the order of hundreds of pages, we use thesequential file ( seq_file ) kernel interface . This interfacesafely handles proc filesystem outputs spanning multiplememory pages. VALUATION
This section aims to demonstrate the capabilities of theproposed and implemented CacheFlow technique. This is
8. Note that it is still important to prevent cache pollution while thecurrent snapshot is being acquired. Thus, pollution-free retrieval is stillcrucial.9. See https://lwn.net/Articles/22355/ for more details. not meant to be an exhaustive evaluation of all the scenariosin which CacheFlow might be employed, but rather todemonstrate that CacheFlow is capable of producing usefulinsights on the cache usage of real applications in a realsystem.
All the experiments described in this section have beencarried out on an NVIDIA Jetson TX1 development systemrunning Linux v4.14. The Jetson TX1 features an NVIDIATegra X1 SoC, in line with what is described in Section 6.For our workload, we use a combination of syntheticand real benchmarks. Additional details about the syntheticbenchmarks we designed are provided contextually to theexperiment in which they are employed. For our real bench-marks, we considered applications from the the San DiegoVision Benchmarks (SD-VBS) [40], which come with multi-ple input sizes. Our goal is to demonstrate the usefulnessof CacheFlow in analyzing an application’s cache behav-ior. As such, we include only a selection of the obtainedresults covering the most interesting cases. We selectedthe D
ISPARITY , M
SER , S
IFT , and T
RACK benchmarks withintermediate input sizes, namely VGA (640x480) and CIF(352x288) images.The remainder of this section is organized to address thefollowing questions:1) Is CacheFlow able to provide an output that is repre-sentative of the actual cache behavior of an application?This is covered in Section 7.2.2) What is the overhead in terms of cache content pollu-tion and time? We discuss this aspect in Section 7.3.3) Is it possible to track the cache behavior of real applica-tions in term of WSS and frequently accessed memorylocations using CacheFlow? We tackle this question inSection 7.44) Can CacheFlow reveal system-level properties, suchas (1) how the cache is being shared by concurrentapplications; (2) how scheduling decisions impact cacheusage? Section 7.5 approaches this question.5) Is it possible to use CacheFlow to predict whether anapplication will suffer measurable cache interferencefrom a co-running application? This is explored inSection 7.6.6) Can we study the replacement policy and implementedby the target platform and its statistical characteristics?This analysis is conducted in Section 7.7.
The first aspect to validate is whether or not CacheFlowis able to provide an output that can be trusted , in thesense that it is meaningfully related to the behavior of theapplication under analysis. We study the output producedby CacheFlow on a synthetic benchmark. The benchmarkallocates two buffers of 512 KB each. It then performs a fullwrite followed by a full read on the first buffer. Next, itperforms a full write followed by a full read on the secondbuffer.We set our trigger to operate in periodic, synchronousmode, with an interval of 2 milliseconds between snapshots.We then plot a heat-map of the number of cache lines +1+2[vvar] +4+8+12+16+20+24+28libc +0[text] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22Snapshot Fig. 3. S
YNTH memory heat-map analysis. Flush mode. +31+32[stack] +1+2+3+4libc +1+2+3+4libc 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22Snapshot
Fig. 4. S
YNTH memory heat-map analysis. Transparent mode. found in the snapshot for each of the pages that belong tothe benchmark under analysis. We perform the experimenttwice, once in flush mode and then in transparent mode.The results of the two experiments are depicted in Figure 3and 4.In Figure 3 we depict the heat-map for the 4 mostused memory regions — as determined by reading thecontent of the snapshots. We only depict the most used(and interesting) region in Figure 4. The color scale weused indicates the number of lines present per 4 KB page ineach snapshot, with darker blue tones indicating fewer linesand tones shifting towards yellow signifying more lines.The progression of the snapshots is plotted on the x axis.Page addresses are then plotted on the y axis in terms ofrelative offset (in pages) from the beginning of the region.We annotate the name of the region on the left-hand side ofthe plot.A number of important observations can be made. Wesummarize the most interesting ones. First, we notice that atthe beginning of the application, the pages of the stack, thetext and glibc library are mostly present in cache. Thisis in line with the initialization of the application whichuses malloc to allocate its buffer. For small allocations, malloc expands the heap of the calling process. But if theallocation of larger buffers is requested, the glibc resortsto performing a mmap call instead to create a new memoryregion not backed by a file (anonymous memory). It is inthis anonymous region, marked as “[anon]” in the plot,that the bulk of memory accesses will be performed by ourbenchmark. We can also notice that the region comprises 256pages, i.e. 1 MB.Let us now focus on this region. In Figure 3, one caneasily distinguish the scan performed by the benchmarkover the buffers. Each scan shows up as bands of yellowmoving left-to-right. Because a full pass of stores followedby a full pass of loads is performed in each scan, two yellowbands are visible in each scan. By looking at the slope of thesebands, it is possible to understand the rate of progress of thebenchmark moving through the buffer. It can be noted thatstores are slower than loads — it takes around 8 snapshotsto complete the first store sub-iteration, and only 5 for theloads. This might sound counter-intuitive but indeed makessense because (1) in write-back caches a store might result TABLE 2Pollution and Time overhead of CacheFlow.
Space Overhead (%) Time Overhead ( Cycles)Mode Avg. Std. Dev Max Avg. Std. Dev. Max
Full Flush
Resolve+Layout
Resolve
Layout
Full Transparent in a write-back of a dirty line followed by a load frommain memory; and (2) a cache with many outstanding storetransactions will stall when its internal write-buffer is full.Because Figure 3 was produced in flush mode, the onlycache lines highlighted in the heat-map are those that wereaccessed by the application in-between snapshots. It goesto demonstrate that flush mode is particularly well suitedto study the active cache set of applications. Conversely,operating in transparent mode allows us to understandhow/if cache lines allocated at some point during execu-tion persist in cache. Figure 4 depicts one such case. Aftersnapshot 12, the application will not access the top portionof its addressing space. Nonetheless, since the overall buffertouched by the application is smaller than the cache size,the unused lines remain in cache. These lines slightly fadein color because some eviction still occurs due to the randomreplacement policy of this cache.
We evaluate the overhead introduced by CacheFlow alongtwo dimensions: cache pollution and timing. We measurecache pollution as the change in cache content that is intro-duced solely because of CacheFlow’s activity. To evaluatecache pollution, we first execute a cache-intensive applica-tion for a given lead time set to 100 ms. The lead time allowsthe application to populate the cache. Then, we set theTrigger to activate at the end of the lead time and to acquiretwo consecutive snapshots back-to-back. The first snapshotcaptures the content of the cache status as populated by theapplication under analysis. The second snapshot is used tounderstand the change in cache status introduced by kernel-to-user copy of the first snapshot and its post-processing.The overhead then is evaluated in terms of percentage ofcache lines that have changed over the total number of cachelines. The overhead in time is evaluated by measuring theend-to-end time required to acquire a single snapshot.The results of the overhead measurements are reportedin Table 2 and are the aggregation of 100 experimentsfor each setup. We measure pollution and time overheadin synchronous mode because it is the most suitable forgeneral analysis scenarios. We then consider a host of dif-ferent sub-modes. In “
Full Flush ”, CacheFlow operates inflush mode with physical → virtual translation and VMAlayout acquisition active. As expected, around 95% of thecache content is modified in this mode after a snapshotis completed. The next four cases correspond to the over-head of CacheFlow operating in transparent mode. In the“ Resolve+Layout ” mode, both address resolution and VMAlayout acquisition are turned on. In the “
Resolve ” (resp.,“
Layout ”) case, only address resolution (resp., VMA layoutacquisition) are enabled. Finally in the “Full Transparent” +1+2+3+4libc +0+2+4[text] +0+46+92+138+184+230+276[heap] 0 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153 162 171 180 189Snapshot Fig. 5. D
ISPARITY memory heat-map when executed on VGA input. +30+31+32[stack] +0+3+6[text] +0+450+900+1350+1800+2250+2700[anon] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190Snapshot
Fig. 6. S
IFT memory heat-map when executed on VGA input. case, both address resolution and VMA layout acquisitionare skipped. Notably, cache pollution in full transparentmode does not exceed 1.4%, which is acceptable for asoftware-only solution. Times are reported in millions ofCPU cycles to better generalize to other platforms. TheCPUs operate at 1.9 GHz on the target platform.
Having assessed that CacheFlow can indeed provide ac-curate insights into an application’s cache utilization, weanalyze two real applications. Figure 5 and Figure 6 reportthe heat-map analysis of SD-VBS benchmarks D
ISPARITY and S
IFT , respectively, using VGA resolution images ininput.It can be observed that D
ISPARITY is characterized byquite distinguishable intro (snap. 0-25) and outro phases(snap. 172-198). In the intro, the input buffer is first pre-processed in the “anon” region. It follows an intermediateprocessing phase that actively uses around 83% of theregion’s memory with a recurring pattern and pages withoffset 466-930 being the most frequently accessed. Duringthe outro phase the final output is produced sequentiallyon the heap. A quite different picture is painted by the S
IFT application. In this case, there exists an initial phase (snap. 0-40) where pre-processing is performed on an anonymous re-gion; then the bulk of processing is carried out on the heap.From around snapshot 30 until 170, two non-contiguous setsof memory pages are in use, that gradually span through82% of the region. The final computation step is performedon the top 75% of the heap.
CacheFlow can also provide valuable insights on the cache-related behavior of the whole system when multiple ap-plications are executed. To evaluate this, we concurrently L C a c h e O cc u p a n c y parent track. (vga) mser (vga) sift (cif) disp. (cif) other unres Fig. 7. Single-core execution of 4 SD-VBS benchmarks with defaultcompletely fair scheduler (Linux’s default). L C a c h e O cc u p a n c y parent track. (vga) mser (vga) sift (cif) disp. (cif) other unres Fig. 8. Single-core execution of 4 SD-VBS benchmarks with fixed-priorityreal-time scheduler. run four SD-VBS applications, namely T
RACK , M
SER , S
IFT ,D ISPARITY , all with VGA resolution images in input.In the first experiment, the benchmarks are executed on asingle core — the other three cores are turned off. Moreover,the default Linux’s scheduling policy (
SCHED_OTHER ) isused. We then set the Trigger to acquire periodic snapshotsevery 10 ms and study the per-process L2 occupancy overtime. The results are shown in Figure 7. The
SCHED_OTHER scheduler implements a Completely Fair Scheduler (CFS),which tries to ensure an even progress of co-running work-load. This results in frequent context-switches that result inquick changes in the composition of the L2 cache contentvisible in the figure. The figure also shows how more cache-intensive benchmarks such as D
ISPARITY can dominateother workload in terms of utilization of cache resources.To understand the impact of scheduling policies oncache utilization, we conduct a similar experiment wherewe use a fixed-priority real-time scheduler. Specifically, werun the benchmarks with
SCHED_FIFO policy and set theirpriorities to be in decreasing order — i.e. T
RACK has highestpriority, D
ISPARITY has lowest priority. With this setup, weobtain Figure 8. In this case, it is clear that the executionof the benchmarks is performed in strict order with nointerleaving. It can also be noted how the benchmarks underanalysis vary their WSS as they progress. Another interest-ing observation is that overall the execution takes less time(640 snapshots) compared to Figure 7, likely due to bettercache locality in absence of frequent context switching.Next, we demonstrate that CacheFlow can also be usedto investigate the behavior of the shared L2 cache even withmultiple cores being active. In this setup, we turn on 3 out of L C a c h e O cc u p a n c y parent track. (vga) mser (vga) sift (cif) disp. (cif) other unres Fig. 9. Execution of 4 SD-VBS benchmarks with fixed-priority real-timescheduler on 3 cores. and deploy the same set of 4 SD-VBS benchmarkswith the arrangement of real-time priorities. We do not con-trol process-to-CPU assignment and let the Linux schedulerallocate processes to cores at runtime. What emerges interms of per-process L2 occupancy is depicted in Figure 9.Given that L2 is a shared cache, we see multiple applicationscompeting for cache space. As they progress, the amount ofoccupied cache depends not only on their current WSS, butalso on the WSS of co-running applications. This is particu-larly clear when looking at the interplay in cache betweenthe T RACK and M
SER benchmarks compared to what weobserved in Figure 8. Because only 3 cores are available,it can also be noted that D
ISPARITY only starts executingat snapshot 68, once again significantly dominating cacheutilization in the central portion of its execution (snap. 98-215).
In this section, we investigate if cache snapshotting canbe used to predict clash on cache resources between co-running applications. To conduct this analysis, we firstanalyze the behavior of our applications in isolation. We usesnapshotting to derive two type of profiles on the SD-VBSapplications under analysis. Both profiles are constructedfrom snapshots acquired in flush mode. In the first profile,we study the sheer size of data allocated in cache at eachsnapshot — “Active” set analysis. In the second type ofprofile, namely “Reused” set analysis, we evaluate onlythe number of lines that are the same and present in twosuccessive snapshots. The reused set captures the amountof data that, if evicted, causes a penalty in execution time.While active set analysis could be carried out by carefullyinterpreting cache performance counters, reused set analysiscan only be conducted by leveraging the exact knowledgeof what lines are cached from time to time. Hence, this typeof analysis was previously limited to simulation studies.Figure 10 depicts the two profiles for each of the consid-ered real benchmarks. In addition, we considered a syntheticbenchmark called B
OMB which continually and sequentiallyaccesses a 2.5 MB buffer. From the figure, it appears thatwhile there exists a positive correlation between the activeand reuse set, the two often (and substantially) differ.
10. We also conducted the experiment on the full 4-core setup, whichyielded similar results and is omitted due to space constraints. L li n e s ( % ) disparity - Active disparity - Reused L li n e s ( % ) mser - Active mser - Reused L li n e s ( % ) sift - Active sift - Reused L li n e s ( % ) tracking - Active tracking - Reused L li n e s ( % ) bomb - Active bomb - Reused Fig. 10. Active set (left) and reused set (right) analysis of SD-VBS andB
OMB benchmarks.
Using the profiles, we build two metrics to try andpredict the impact that two applications running in parallelwould have on each other. For this experiment, we establishthe ground truth by observing the slowdown suffered by anapplication under analysis when running in parallel with aninterfering application. In these measurements, CacheFlowis not used. The results are reported in Table 3, in the firstgroup of rows.The first metric, namely “Active Set Excess” is based onactive sets only. In this case, we consider two applicationsat a time: an observed, and an interfering one. We thenconsider the sum of their active sets on a per-snapshotbasis, and compute how much that sum exceeds the sizeof the cache (e.g. by 0%, by 50% etc.). We then averagethis quantity over the full length of the profile. The resultsobtained from applying this metric on all the considered SD-VBS applications are reported in Table 3 — second group ofrows.A second metric, namely “Reused Set Eviction”, consid-ers how much of the reused set of the application underanalysis is potentially evicted by an interfering application.To build this metric, we once again reason on a per-snapshotbasis. We multiply the reused set quota of the applicationunder analysis by the active set quota of the interferingapplication. The average over the full length of the profile isthen computed. The third group of rows in Table 3 reportsthe reused set eviction metric computed for the consideredbenchmarks.Finally, we computed the correlation between the twometrics described above and the ground truth on the mea-sured slowdown. The reused set eviction metric revealed an80% correlation with slowdown. It outperformed the activeset excess metric which achieved 74% correlation. Theseresults serve as a proof-of-concept that CacheFlow can beused, for instance, to perform interference-aware scheduling TABLE 3Correlation of Slowdown and Cache Activity-Based Indices B ENCHMARK U NDER A NALYSIS I NTERF .A PPLIC . D ISPARITY M SER S IFT T RACK D ISPARITY
SER
IFT
RACK × ) B OMB
ISPARITY
SER
IFT
RACK
OMB
ORRELATION : D ISPARITY
SER
IFT
RACK
OMB
ORRELATION : decisions. The last aspect we evaluated was the capability intro-duced by CacheFlow to evaluate the replacement policyimplemented by the hardware. We focus on two aspects.First, we validate that the policy indeed follows randomreplacement. Second, we study how well the implementedreplacement policy matches a truly random replacement. Toconduct these experiments we devised a special syntheticbenchmark, namely R
EPL . The R
EPL benchmark allocatesfrom user-space a set of contiguous physical pages with atotal size equal to the L2 cache size. This is done by mmap leveraging support for huge pages —
MAP_HUGE_2MB flag.Because the allocated buffer is aligned with the cachesize, the first line, say Line A , of the buffer necessarily mapsto cache set 0. Similarly, the line (Line A ) that is exactly128 KB away (the size of one cache way) from Line A also maps to set 0. With a similar reasoning we identifya set of 16 lines { A , . . . , A } that map to set 0. If thecache implements a deterministic cache replacement policy(e.g. LRU or FIFO), then accessing lines A through A will result in a snapshot containing all the lines. Followingthis idea, the R EPL benchmark touches lines A through A a configurable number of times (iterations). Then, itdelivers a signal to the Trigger (event-based activation) toacquire a snapshot in transparent mode. In the snapshot,we evaluate how many of the 16 lines are actually presentin cache. By repeating the same experiment 1000 times, wecan plot the probability density that k ∈ { , . . . , } out of16 lines are found in cache. We repeat the experiment byperforming from 1 to 8 iterations over lines A through A before acquiring a snapshot. The resulting density plots arereported in Figure 11.In our last experiment, we evaluate how closely theimplemented random replacement policy matches a trulyrandom replacement policy. To evaluate this aspect, we usea variant of the R EPL benchmark. We consider again lines A through A . But in this case, the benchmark activates P r o b a b ili t y k )0.10.30.5 1 iter.2 iter. 3 iter.4 iter. 5 iter.6 iter. 7 iter.8 iter. Fit Fig. 11. Probability of k lines on the same cache set being cached afterhaving been accessed 1 (top) through 8 (bottom) times. P r o b a b ili t y Ideal Replacement Observed Replacement
Fig. 12. Probability of cache way 1-16 being selected for replacement. the Trigger after touching each line and stops after reachingline A . A single run produces 16 snapshots. Moreover insnapshot 1, we can derive which cache way was selectedto allocate A , in snapshot 2 what way was selected for A and so on. We run the experiment 2000 times and collect atotal of 32,000 replacement decisions. We then compute thenumber of times each of the 16 cache ways were selected forallocation over the total number of observations. The resultsare reported in Figure 12. In a perfect random replacementscheme, each way has probability = 6 . of beingselected. The implemented replacement policy does notdeviate significantly from a perfect replacement, althoughinterestingly, it appears that the central ways (ways 7 to 11)are statistically less likely to be selected for allocation. ONCLUSION AND F UTURE W ORK
In this work, we have proposed a technique, namelyCacheFlow, that leverages existing yet untapped micro-architectural support to enable cache content snap-shotting. The proposed implementation has highlightedthat CacheFlow can provide unprecedented insights onapplication-level features in the usage of cache resources,and on system-level properties. As such, we envision that CacheFlow and its analogues will serve as a powerfulinstrument for system designers to better understand theinterplay between applications, system components, andcache hierarchy. Ultimately, we expect that it will com-plement existing simulation-based and static analysis ap-proaches by providing a way to refine and validate cachememory models. While we have restricted ourselves toanalyzing the LLC in this work,
RAMINDEX ’s capabilitiesfar exceed that. Hence, we encourage the community toextend and refine the proposed open-source implementationto conduct a wider range of studies, e.g. on the behavior ofprivate caches, TLBs, and coherence controllers. R EFERENCES [1] M. Lv, N. Guan, J. Reineke, R. Wilhelm, and W. Yi, “A survey onstatic cache analysis for real-time systems,”
Leibniz Transactions onEmbedded Systems , vol. 3, no. 1, pp. 05–1, 2016.[2] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing,D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra et al. ,“The worst-case execution-time problemoverview of methods andsurvey of tools,”
ACM Transactions on Embedded Computing Systems(TECS) , vol. 7, no. 3, pp. 1–53, 2008.[3] D. Grund,
Static Cache Analysis for Real-Time Systems: LRU, FIFO,PLRU . epubli, 2012.[4] C. Ferdinand and R. Wilhelm, “Efficient and precise cache behav-ior prediction for real-time systems,”
Real-Time Systems , vol. 17,no. 2-3, pp. 131–181, 1999.[5] P. Cousot and R. Cousot, “Basic concepts of abstract interpreta-tion,” in
Building the Information Society . Springer, 2004, pp. 359–366.[6] R. T. White, F. Mueller, C. A. Healy, D. B. Whalley, and M. G. Har-mon, “Timing analysis for data caches and set-associative caches,”in
Proceedings Third IEEE Real-Time Technology and ApplicationsSymposium . IEEE, 1997, pp. 192–202.[7] J. Reineke, D. Grund, C. Berg, and R. Wilhelm, “Timing pre-dictability of cache replacement policies,”
Real-Time Systems ,vol. 37, no. 2, pp. 99–122, 2007.[8] N. Guan, X. Yang, M. Lv, and W. Yi, “FIFO cache analysis for wcetestimation: A quantitative approach,” in . IEEE, 2013, pp.296–301.[9] D. Grund and J. Reineke, “Precise and efficient fifo-replacementanalysis based on static phase detection,” in . IEEE, 2010, pp. 155–164.[10] D.-H. Chu and J. Jaffar, “Symbolic simulation on complicatedloops for wcet path analysis,” in . IEEE, 2011, pp. 319–328.[11] D. Chu, J. Jaffar, and R. Maghareh, “Precise cache timing analysisvia symbolic execution,” in , 2016, pp. 1–12.[12] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi,A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti,R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, andD. A. Wood, “The gem5 simulator,”
SIGARCH Comput. Archit.News , vol. 39, no. 2, p. 17, Aug. 2011. [Online]. Available:https://doi.org/10.1145/2024716.2024718[13] S. R. Sarangi, R. Kalayappan, P. Kallurkar, S. Goel, and E. Peter,“Tejas: A java based versatile micro-architectural simulator,” in , 2015, pp. 47–54.[14] J. Wang, J. Beu, R. Bheda, T. Conte, Z. Dong, C. Kersey,M. Rasquinha, G. Riley, W. Song, H. Xiao, P. Xu, and S. Yalaman-chili, “Manifold: A parallel simulation framework for multicoresystems,” in , 2014, pp. 106–115.[15] N. Nethercote and J. Seward, “Valgrind: A program supervisionframework,”
Electronic notes in theoretical computer science , vol. 89,no. 2, pp. 44–66, 2003.[16] Z. Wang and J. Henkel, “Fast and accurate cache modeling insource-level simulation of embedded software,” in . IEEE,2013, pp. 587–592. [17] L. M. N. Coutinho, J. L. D. Mendes, and C. A. P. S. Martins,“Mscsim -multilevel and split cache simulator,” in
Proceedings.Frontiers in Education. 36th Annual Conference , Oct 2006, pp. 7–12.[18] S. E. Arda, A. NK, A. A. Goksoy, N. Kumbhare, J. Mack, A. L.Sartor, . A. Akoglu, R. Marculescu, and U. Y. Ogras, “Ds3: Asystem-level domain-specific system-on-chip simulation frame-work,” 2020.[19] H. Brais, R. Kalayappan, and P. R. Panda, “A survey of cachesimulators,”
ACM Comput. Surv. , vol. 53, no. 1, Feb. 2020.[20] N. Nethercote and J. Seward, “Valgrind: A framework for heavy-weight dynamic binary instrumentation,” in
Proceedings of the 28thACM SIGPLAN Conference on Programming Language Design andImplementation , ser. PLDI 07. New York, NY, USA: Associationfor Computing Machinery, 2007, p. 89100.[21] D. Levinthal, “Performance analysis guide for intel core i7 pro-cessor and intel xeon 5500 processors,”
Intel Performance AnalysisGuide , vol. 30, p. 18, 2009.[22] J. Treibig, G. Hager, and G. Wellein, “Performance patterns andhardware metrics on modern multicore processors: Best practicesfor performance engineering,” in
European Conference on ParallelProcessing . Springer, 2012, pp. 451–460.[23] D. Molka, R. Sch¨one, D. Hackenberg, and W. E. Nagel, “Detectingmemory-boundedness with hardware performance counters,” in
Proceedings of the 8th ACM/SPEC on International Conference onPerformance Engineering , ser. ICPE 17. New York, NY, USA:Association for Computing Machinery, 2017, p. 2738.[24] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portableinterface to hardware performance counters,” in
Proceedings of thedepartment of defense HPCMP users group conference , vol. 710, 1999.[25] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “Aportable programming interface for performance evaluation onmodern processors,”
The international journal of high performancecomputing applications , vol. 14, no. 3, pp. 189–204, 2000.[26] J. Treibig, G. Hager, and G. Wellein, “Likwid: A lightweightperformance-oriented tool suite for x86 multicore environments,”in . IEEE, 2010, pp. 207–216.[27] M. Selva, L. Morel, and K. Marquet, “numap: A portable libraryfor low-level memory profiling,” in . IEEE, 2016, pp. 55–62.[28] A. Vishnoi, P. R. Panda, and M. Balakrishnan, “Online cache statedumping for processor debug,” in
Proceedings of the 46th AnnualDesign Automation Conference , 2009, pp. 358–363.[29] B. R. Buck and J. K. Hollingsworth, “A new hardware monitordesign to measure data structure-specific cache eviction infor-mation,”
The International Journal of High Performance ComputingApplications , vol. 20, no. 3, pp. 353–363, 2006.[30] P. R. Panda, A. Vishnoi, and M. Balakrishnan, “Enhancing post-silicon processor debug with incremental cache state dumping,”in . IEEE, 2010, pp. 55–60.[31] ARM Holdings, “Cortex-A53 MPCore technical reference manual(r0p4),” 2018.[32] ——, “Cortex-A57 MPCore technical reference manual (r1p3),”2016.[33] ——, “Cortex-A72 MPCore technical reference manual (r0p2),”2016.[34] ——, “Cortex-A15 technical reference manual (r2p0),” 2011.[35] ——, “Cortex-A9 technical reference manual (r4p0),” 2009.[36] J. Corbet, J. Edge, and R. Sobol, “Kernel Development,” LinuxWeekly News – https://lwn.net/Articles/74295/, 2004, [Online;accessed 7-May-2019].[37] ARM Holdings, “CortexA76 Core technical reference manual(r3p0),” 2018.[38] ——, “CortexA77 Core technical reference manual (r1p1),” 2019.[39] Nvidia Corporation, “NVIDIA Tegra X1 Technical Refer-ence Manual,” https://developer.nvidia.com/embedded/tegra-2-reference.[40] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia,S. Belongie, and M. B. Taylor, “SD-VBS: The san diego visionbenchmark suite,” in2009 IEEE International Symposium on Work-load Characterization (IISWC)