[PDF] On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Abstract

Operating systems have historically had to manage only a single type of memory device. The imminent availability of heterogeneous memory devices based on emerging memory technologies confronts the classic single memory model and opens a new spectrum of possibilities for memory management. Transparent data movement between different memory devices based on access patterns of applications is a desired feature to make optimal use of such devices and to hide the complexity of memory management to the end-user. However, capturing memory access patterns of an application at runtime comes at a cost, which is particularly challenging for large scale parallel applications that may be sensitive to system noise. In this work, we focus on the access pattern profiling phase prior to the actual memory relocation. We study the feasibility of using Intel's Processor Event-Based Sampling (PEBS) feature to record memory accesses by sampling at runtime and study the overhead at scale. We have implemented a custom PEBS driver in the IHK/McKernel lightweight multi-kernel operating system, one of whose advantages is minimal system interference due to the lightweight kernel's simple design compared to other OS kernels such as Linux. We present the PEBS overhead of a set of scientific applications and show the access patterns identified in noise-sensitive HPC applications. Our results show that clear access patterns can be captured with a 10% overhead in the worst-case and 1% in the best case when running on up to 128k CPU cores (2,048 Intel Xeon Phi Knights Landing nodes). We conclude that online memory access profiling using PEBS at large scale is promising for memory management in heterogeneous memory environments.

Full PDF

OOn the Applicability of PEBS based Online Memory AccessTracking for Heterogeneous Memory Management at Scale

Aleix Roca Nonell, Balazs Gerofi ‡ , Leonardo Bautista-Gomez,Dominique Martinet † , Vicenç Beltran Querol, Yutaka Ishikawa ‡ Barcelona Supercomputing Center, Spain † CEA, France ‡ RIKEN Center for Computational Science, Japan{aleix.rocanonell,leonardo.bautista,vbeltran}@bsc.es,[email protected],{bgerofi,yutaka.ishikawa}@riken.jp

ABSTRACT

Operating systems have historically had to manage only a singletype of memory device. The imminent availability of heterogeneousmemory devices based on emerging memory technologies confrontsthe classic single memory model and opens a new spectrum ofpossibilities for memory management. Transparent data movementbetween different memory devices based on access patterns ofapplications is a desired feature to make optimal use of such devicesand to hide the complexity of memory management to the end user.However, capturing memory access patterns of an application atruntime comes at a cost, which is particularly challenging for large-scale parallel applications that may be sensitive to system noise.In this work, we focus on the access pattern profiling phase priorto the actual memory relocation. We study the feasibility of usingIntel’s Processor Event-Based Sampling (PEBS) feature to recordmemory accesses by sampling at runtime and study the overheadat scale. We have implemented a custom PEBS driver in the IHK/-McKernel lightweight multi-kernel operating system, one of whoseadvantages is minimal system interference due to the lightweightkernel’s simple design compared to other OS kernels such as Linux.We present the PEBS overhead of a set of scientific applications andshow the access patterns identified in noise sensitive HPC applica-tions. Our results show that clear access patterns can be capturedwith a 10% overhead in the worst-case and 1% in the best case whenrunning on up to 128k CPU cores (2,048 Intel Xeon Phi KnightsLanding nodes). We conclude that online memory access profilingusing PEBS at large-scale is promising for memory management inheterogeneous memory environments.

CCS CONCEPTS • Software and its engineering → Operating systems ; KEYWORDS high-performance computing, operating systems, heterogeneousmemory

ACM Reference Format:

Aleix Roca Nonell, Balazs Gerofi ‡ , Leonardo Bautista-Gomez, DominiqueMartinet † , Vicenç Beltran Querol, Yutaka Ishikawa ‡ . 2020. On the Applica-bility of PEBS based Online Memory Access Tracking for HeterogeneousMemory Management at Scale. In Proceedings of ACM Conference (Confer-ence’17).

ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3286475.3286477 ©2018 Association for Computing Machinery. This is the author’sversion of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published inhttps://dl.acm.org/citation.cfm?id=3286477https://doi.org/10.1145/3286475.3286477

The past decade has brought an explosion of new memory tech-nologies. Various high-bandwidth memory types, e.g., 3D stackedDRAM (HBM), GDDR and multi-channel DRAM (MCDRAM) aswell as byte addressable non-volatile storage class memories (SCM),e.g., phase-change memory (PCM), resistive RAM (ReRAM) and therecent 3D XPoint, are already in production or expected to becomeavailable in the near future.Management of such heterogeneous memory types is a majorchallenge for application developers, not only in terms of placingdata structures into the most suitable memory but also to adaptivelymove content as application characteristics changes in time. Oper-ating system and/or runtime level solutions that optimize memoryallocations and data movement by transparently mapping applica-tion behavior to the underlying hardware are thus highly desired.One of the basic requirements of a system level solution is theability to track the application’s memory access patterns in real-time with low overhead. However, existing solutions for access pat-tern tracking are often based on dynamic instrumentation, whichhave prohibitive overhead for an online approach [16]. Conse-quently, system level techniques targeting heterogeneous memorymanagement typically rely on a two-phase model, where the ap-plication is profiled first, based on which the suggested allocationpolicy is then determined [5, 19].Intel’s Processor Event-Based Sampling (PEBS) [3] is an exten-sion to hardware performance counters that enables sampling theinternal execution state of the CPU (including the most recentvirtual address accessed) and periodically storing a snapshot of itinto main memory. The overhead of PEBS has been the focus ofprevious works [1, 15], however, not in the context of large-scalehigh-performance computing (HPC).The hardware PEBS support provides a number of configurationknobs that control how often PEBS records are stored and how oftenthe CPU is interrupted for additional background data processing.Because such disruption typically degrades performance at scale [6,12], it is important to characterize and understand this overhead toassess PEBS’ applicability for heterogeneous memory management a r X i v : . [ c s . O S ] N ov n large-scale HPC. Indeed, none of the previous studies focusingon PEBS’ overhead we are aware of have addressed large-scaleenvironments.We have implemented a custom PEBS driver in the IHK/McKernellightweight multi-kernel operating system [8, 9]. Our motivationfor a lightweight kernel (LWK) is threefold. First, lightweight ker-nels are known to be highly noise-free and thus they provide anexcellent environment for characterizing PEBS’ overhead. Second,McKernel has a relatively simple code-base that enables us to rapidlyprototype kernel level features for heterogeneous memory man-agement and allow direct integration with our PEBS driver. Ourcustom driver can be easily configured and enables fine-grainedtuning of parameters that are otherwise not available in the Linuxdriver (see Section 3 for more details). Finally, the Linux PEBS dri-ver on the platform we used in this study, i.e., the Oakforest-PACSmachine [13] based on Intel’s Xeon Phi Knight’s Landing chip, wasnot available.As the baseline for OS level hierarchy memory management, weaimed at answering the following questions. What is the overheadof real-time memory accesses tracking at scale? What is the trade-off between sampling granularity and the introduced overhead? Isit feasible to rely on PEBS for collecting such information online?Specifically, in this paper we make the following contributions: • An implementation of a custom PEBS driver in an LWK withthe ability of fine-tuning its parameters • Systematic evaluation of PEBS’ overhead on a number ofreal HPC applications running at large scale • Demonstration of captured memory access patterns as thefunction of different PEBS parametersPrevious studies have reported PEBS failing to provide increasedaccuracy with reset values (see Section 2.1) lower than 1024 [1, 15]as well as the Linux kernel becoming unstable when performingPEBS based sampling on high frequency [18]. On up to 128k CPUcores (2,048 Xeon Phi KNL nodes), we find that our custom drivercaptures increasingly accurate access patterns reliably even withvery low reset values. Across all of our workloads, PEBS incurs anoverhead of 2.3% on average with approximately 10% and 1% in theworst and best cases, respectively.The rest of this paper is organized as follows. We begin by ex-plaining the background and motivations in Section 2. We describethe design and implementation of our custom PEBS driver in Sec-tion 3. Our large-scale evaluation is provided in Section 4. Section 5discusses related work, and finally, Section 6 concludes the paper.

This section lays the groundwork for the proposed driver archi-tecture by providing background information on Intel’s ProcessorEvent-Based Sampling facility [3] and the IHK/McKernel light-weight multi-kernel OS [7–9].

Processor Event-Based Sampling (PEBS) is a feature of some In-tel microarchitectures that builds on top of Intel’s PerformanceCounter Monitor (PCM).The PCM facility allows to monitor a number of predefinedprocessor performance parameters (hereinafter called "events") by counting the number of occurrences of the specified events in a setof dedicated hardware registers. When a PCM counter overflowsan interrupt is triggered, which eases the process of sampling.PEBS extends the idea of PCM by transparently storing additionalprocessor information while monitoring a PCM event. However,only a small subset of the PCM events actually support PEBS. A"PEBS record" is stored by the CPU in a user-defined memory bufferwhen a configurable number of PCM events, named "PEBS resetcounter value" or simply "reset", occur. The actual PEBS recordformat is microarchitecture dependent, but it generally includesthe set of general-purpose registers.A "PEBS assist" in Intel nomenclature is the action of storingthe PEBS record into the CPU buffer. When the record written inthe last PEBS assist reaches a configurable threshold inside theCPU PEBS buffer, an interrupt is triggered. The interrupt handlershould process the PEBS data and clear the buffer, allowing the CPUto continue storing more records. The PCM’s overflow interruptremains inactive while a PCM event is being used with PEBS. Memory' '''''''''''

IHK+Master'

Delegator''module'

CPU ' CPU ' CPU ' CPU ' …' …' McKernel'

Linux' ''

System'daemon'Kernel'daemon'

Proxy'process' IHK+Slave' ApplicaAon'

Interrupt' System'call' System'call'

ParAAon' ParAAon' (cid:7) (cid:8) (cid:2) (cid:17) (cid:16) (cid:25)(cid:25) (cid:13) (cid:23) (cid:2) (cid:11) (cid:21) (cid:20) (cid:25) (cid:10) (cid:16) (cid:20)(cid:13) (cid:12) (cid:2) (cid:16) (cid:20) (cid:2) (cid:5) (cid:16) (cid:20)(cid:26) (cid:28) (cid:3) (cid:2) (cid:5) (cid:9) (cid:4) (cid:2) (cid:16) (cid:24) (cid:2) (cid:16) (cid:24) (cid:21) (cid:18) (cid:10) (cid:25) (cid:13) (cid:12) (cid:1) (cid:7)(cid:20)(cid:18)(cid:29)(cid:2)(cid:22)(cid:13)(cid:23)(cid:14)(cid:21)(cid:23)(cid:19)(cid:10)(cid:20)(cid:11)(cid:13)(cid:2)(cid:24)(cid:13)(cid:20)(cid:24)(cid:16)(cid:25)(cid:16)(cid:27)(cid:13)(cid:1)(cid:24)(cid:29)(cid:24)(cid:25)(cid:13)(cid:19)(cid:2)(cid:11)(cid:10)(cid:18)(cid:18)(cid:24)(cid:2)(cid:10)(cid:23)(cid:13)(cid:2)(cid:16)(cid:19)(cid:22)(cid:18)(cid:13)(cid:19)(cid:13)(cid:20)(cid:25)(cid:13)(cid:12)(cid:2)(cid:16)(cid:20)(cid:2)(cid:6)(cid:11)(cid:4)(cid:13)(cid:23)(cid:20)(cid:13)(cid:18)(cid:3)(cid:2)(cid:23)(cid:13)(cid:24)(cid:25)(cid:2)(cid:10)(cid:23)(cid:13)(cid:2)(cid:12)(cid:13)(cid:18)(cid:13)(cid:15)(cid:10)(cid:25)(cid:13)(cid:2)(cid:25)(cid:21)(cid:2)(cid:5)(cid:16)(cid:20)(cid:26)(cid:28)(cid:1)

Figure 1: Overview of the IHK/McKernel architecture.

Lightweight multi-kernels emerged recently as a new operatingsystem architecture for HPC, where the basic idea is to run Linuxand a LWK side-by-side in compute nodes to attain the scalabil-ity properties of LWKs and full compatibility with Linux at thesame time. IHK/McKernel is a multi-kernel OS developed at RIKEN,whose architecture is depicted in Figure 1. A low-level softwareinfrastructure, called Interface for Heterogeneous Kernels (IHK)[21], provides capabilities for partitioning resources in a many-coreenvironment (e.g., CPU cores and physical memory) and it enablesmanagement of lightweight kernels. IHK is capable of allocatingand releasing host resources dynamically and no reboot of the hostmachine is required when altering its configuration. The latest ver-sion of IHK is implemented as a collection of Linux kernel moduleswithout any modifications to the Linux kernel itself, which enablesrelatively straightforward deployment of the multi-kernel stackon a wide range of Linux distributions. Besides resource and LWKmanagement, IHK also facilitates an Inter-kernel Communication(IKC) layer.McKernel is a lightweight co-kernel developed on top of IHK.It is designed explicitly for HPC workloads, but it retains a Linux The exact availability of events depends on the processor’s microarchitecture. How-ever, a small set of "architectural performance events" remain consistent starting fromthe Intel Core Solo and Intel Core Duo generation. ompatible application binary interface (ABI) so that it can executeunmodified Linux binaries. There is no need for recompiling appli-cations or for any McKernel specific libraries. McKernel implementsonly a small set of performance sensitive system calls and the restof the OS services are delegated to Linux. Specifically, McKernelprovides its own memory management, it supports processes andmulti-threading, it has a simple round-robin co-operative (tick-less)scheduler, and it implements standard POSIX signaling. It also im-plements inter-process memory mappings and it offers interfacesfor accessing hardware performance counters.For more information on system call offloading, refer to [8], adetailed description of the device driver support is provided in [9].Recently we have demonstrated that lightweight multi-kernels canindeed outperform Linux on various HPC mini-applications whenevaluated on up to 2,048 Intel Xeon Phi nodes interconnected byIntel’s OmniPath network [7]. As mentioned earlier, with respectto this study, one of the major advantages of a multi-kernel LWK isthe lightweight kernel’s simple codebase that enables us to easilyprototype new kernel level features. This section describes the design and implementation of the McKernelPEBS driver. Figure 2 shows a summary of the entire PEBS recordslifecycle.

Increase PCM register count PEBS assist writes PEBS record into CPU buffer transparentlyPCM register reaches PEBS reset counter value CPU buffer reaches thresholdRead CPU PEBS buffer and filter load addressesWrite addresses + timestamp into per thread bufferReset CPU buffer Dump per process mappings and per thread PEBS bufferinto a file Store mapping details into per process buffer I f munmap operation of any size or mmap > 4MiBLoad Instruction triggers L2 cache missmmap/munmap operationOn thread exit S c ena r i o S c ena r i o S c ena r i o Interrupt triggered Interrupt Handler

Figure 2: Memory addresses acquisition processes using In-tel’s PEBS facility in IHK/McKernel

McKernel uses PEBS as a vehicle to keep track of memory ad-dresses issued by each monitored system thread. Ideally, McKernelwould keep track of all load and store instructions. However, this isnot supported by all Intel microarchitectures. In particular, our testenvironment powered by the Intel Knights Landing processor onlysupports recording the address of load instructions that triggeredsome particular event. PEBS records are always associated with aPCM event. The most general KNL PCM events that support loadaddress recording are L2_HIT_LOADS and L2_MISS_LOADS whichaccount for L2 hits and L2 misses, respectively.Both the count of L2 misses and L2 hits in a page boundary fora given time frame can be used as a metric that determines howlikely is the page to be accessed in the future. A page with a high count of either L2 misses or L2 hits reveals that the page is undermemory pressure. In the case of misses, we additionally know thatthe cache is not able to hold the pages long enough to be reused.And in the case of hits, we know that either pages are accessedwith high enough frequency to remain in the cache or simply thewhole critical memory range fits into the cache.In principle, a page with a high L2 miss ratio seems to be a goodcandidate for being moved into a faster memory device becausemissing the L2 in the case of KNL means that data must be ser-viced from either main memory or the L2 of another core. However,the same page might actually have a higher ratio of L2 hits, indi-cating that another page with a lower hit ratio might benefit stillmore from being moved. In consequence, fair judgment should takeinto consideration both events. Unfortunately, KNL features a sin-gle PCM counter with PEBS support, which means that samplingboth events requires to perform dynamic switching at runtime.Nonetheless, the purpose of this work is just a step behind. Ourobjective is to focus on the study of a single PEBS enabled PCMcounter at scale. Therefore, for simplicity, we decided to rely on theL2_MISS_LOADS event to record the load addresses.McKernel initializes the PEBS CPU data structures at boot timeon each CPU. Processes running in McKernel will enable PEBSon all the CPUs where its threads are running as soon as theystart. As long as the threads are being run, PEBS assists will writePEBS records into the CPU’s buffer transparently regardless of theirexecution context (user or kernel space).The PEBS record format for the Knights Landing architectureconsists of (among others) the set of general-purpose registers andthe address of the load instruction causing the record dump (PEBSassist) if applicable. In total, 24 64-bit fields are stored, adding upto a total of 192 bytes for each PEBS record. There is no timestampinformation stored in each PEBS record so it is not possible to knowexactly when the record took place.When the PEBS remaining capacity reaches the configured thresh-old, an interrupt is triggered. The PEBS interrupt handler filters allfields in the PEBS records but the load address and saves them intoa per-thread circular buffer. Then, the CPU PEBS buffer is reset,allowing the CPU to continue storing records. Altogether with theload addresses, a timestamp is saved at the time the interrupt han-dler is running. This timestamp tags all the PEBS records processedin this interrupt handler execution for posterior analysis.When each of the application’s threads exit, the entire contentsof the per-thread buffer is dumped into a file. We have developed asmall python visualization tool to read and generate plots based onthe information provided.The registered load addresses might not belong to application-specific user buffers but from anywhere in the address space. Foroffline visualization purposes we are mostly interested in profil-ing the application’s buffers and hence, it is convenient to providesome means to filter the undesired addresses. Load addresses canbe sparse, and visualizing the entire address space of an applica-tion to detect patterns might be difficult. It is important to noticethat filtering is not a requirement for online monitoring of highdemanded pages, this is only necessary for visualization.A simple heuristic to do so is to filter out all addresses of smallmappings. To minimize the impact of filtering, the postprocessing isdone offline in our visualization script. Hence, McKernel only keeps rack of all mappings greater than four megabytes by storing itsstart addresses, the length and the timestamp at which the operationcompleted. All munmap operations are also registered regardlessof its size because they might split a bigger tracked area. The map-pings information are stored into a per-process buffer, shared by allthreads using a lock-free queue. The per-process mappings bufferis also dumped into the PEBS file at each thread’s termination time.Our PEBS addresses viewer loads the file and reconstructs theprocesses virtual memory mappings history based on the mmapand munmap memory ranges and timestamps. Then, it reads allthe registered PEBS load addresses and classifies them into theright spatial and temporal mapping or discards them if no suitablemapping is found. Finally, individual plots are shown per mapping.The PEBS data acquisition rate is controlled by the configurablenumber of events that trigger a PEBS assist and the size of the CPUPEBS buffer (which indirectly controls the threshold that triggersan interrupt). We have added a simple interface into McKernelto dynamically configure these parameters at application launchtime by resizing the CPU buffer and reconfiguring the PEBS MSRregisters as requested. This differs from the current Linux Kerneldriver in which it is only possible to configure the reset countervalue but not the PEBS buffer size.It would be ideal to have a big enough CPU buffer to hold allload addresses the application generates to both reduce the mem-ory movements between buffers and to suppress the interruptsoverhead. However, having a small interrupt rate also diffuses thetime perception of memory accesses because timestamps are asso-ciated with PEBS records in the interrupt handler. Therefore, thisimplementation actually requires to set up a proper interrupt rateto understand the evolution of memory accesses in time. Note thatinstead of relying on the interrupt handler to harvest the PEBS CPUbuffer, another option is to dedicate a hardware thread to this task.We plan to implement this option in the near future. All of our experiments were performed on Oakforest-PACS (OFP),a Fujitsu built, 25 petaflops supercomputer installed at JCAHPC,managed by The University of Tsukuba and The University ofTokyo [13]. OFP is comprised of eight-thousand compute nodesthat are interconnected by Intel’s Omni Path network. Each nodeis equipped with an Intel ® Xeon Phi ™ . ThisCentOS distribution contains a number of Intel supplied kernellevel improvements specifically targeting the KNL processor thatwere originally distributed in Intel’s XPPSL package. We used IntelMPI Version 2018 Update 1 Build 20171011 (id: 17941) in this study. For all experiments, we dedicated 64 CPU cores to the appli-cations (i.e., to McKernel) and reserved 4 CPU cores for Linuxactivities. This is a common scenario for OFP users where daemonsand other system services run on the first four cores even in Linuxonly configuration. We used a number of mini-applications from the CORAL benchmarksuite [2] and one developed at the The University of Tokyo. Alongwith a brief description, we also provide information regardingtheir runtime configuration. • GeoFEM solves 3D linear elasticity problems in simple cubegeometries by parallel finite-element method [17]. We usedweak-scaling for GeoFEM and ran 16 MPI ranks per node,where each rank contained 8 OpenMP threads. • HPCG is the High Performance Conjugate Gradients, whichis a stand-alone code that measures the performance of basicoperations in a unified code for sparse matrix-vector mul-tiplication, vector updates, and global dot products [4]. Weused weak-scaling for HPCG and ran 8 MPI ranks per node,where each rank contained 8 OpenMP threads. • Lammps is a classical molecular dynamics code, an acronymfor Large-scale Atomic/Molecular Massively Parallel Simula-tor [20]. We used weak-scaling for Lammps and ran 32 MPIranks per node, where each rank contained four OpenMPthreads. • miniFE is a proxy application for unstructured implicit finiteelement codes [11]. We used strong-scaling for miniFE andran 16 MPI ranks per node, where each rank contained fourOpenMP threads. • Lulesh is the Livermore Unstructured Lagrangian ExplicitShock Hydrodynamics code which was originally definedand as one of five challenge problems in the DARPA UHPCprogram [14]. We used weak-scaling for Lulesh and ran 8MPI ranks per node, where each rank contained 16 OpenMPthreads. • AMG2013 is a parallel algebraic multigrid solver for linearsystems arising from problems on unstructured grids [10].We used weak-scaling for AMG and ran 16 MPI ranks pernode, where each rank contained 16 OpenMP threads.

For each workload described above, we use nine different PEBSconfigurations. We scale the PEBS reset value from 256, through 128to 64 and used PEBS per-CPU buffer sizes of 8kB, 16kB and 32kB. Asmentioned earlier, the reset value controls the sampling granularitywhile the PEBS buffer size impacts the PEBS IRQ frequency. Weemphasize again that contrary to previous reports on PEBS’ inabilityto provide increased accuracy with reset values lower than 1024 [1,15, 18], we find very clear indications that obtaining increasinglyaccurate samples with lower reset values is possible, for which weprovide more information below.We ran each workload for all configurations scaling from 2,048 to128k CPU cores, i.e., from 32 to 2,048 compute nodes, respectively.We compare individually the execution time of each benchmark runon McKernel with and without memory accesses tracking enabled. .0% 2.0% 4.0% 6.0% 8.0% 10.0% 12.0% O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (a) GeoFEM (The University of Tokyo) O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (b) HPCG (CORAL) O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (c) LAMMPS (CORAL) O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (d) Lulesh (CORAL) O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (e) MiniFE (CORAL) O v e r h e a d PEBS 256 32kB PEBS 256 16kB PEBS 256 8kB PEBS 128 32kB PEBS 128 16kB PEBS 128 8kB PEBS 64 32kB PEBS 64 16kB PEBS 64 8kB (f) AMG2013 (CORAL)

Figure 3: PEBS overhead for GeoFEM, HPCG, LAMMPS, Lulesh, MiniFE and AMG on up to 2,048 Xeon Phi KNL nodes

We report the average value of three executions, except for a fewlong-running experiments, where we took only two samples (e.g.,for GeoFEM). Note that all measurements were taken on McKernel and no Linux numbers are provided. For a detailed comparisonbetween Linux and McKernel, refer to [7].Figure 3 summarizes our application level findings. The X-axisrepresents node counts while the Y-axis shows relative overhead

100 200 300 400 500Sample set ID0x2aab224020000x2aab225020000x2aab226020000x2aab227020000x2aab228020000x2aab229020000x2aab22a02000 V i r t u a l A dd r e ss (a) PEBS reset = 64 V i r t u a l A dd r e ss (b) PEBS reset = 128 V i r t u a l A dd r e ss (c) PEBS reset = 256 Figure 4: MiniFE access pattern with different PEBS reset values (8kB PEBS buffer) V i r t u a l A dd r e ss (a) PEBS reset = 64 V i r t u a l A dd r e ss (b) PEBS reset = 128 V i r t u a l A dd r e ss (c) PEBS reset = 256 Figure 5: Lulesh access pattern with different PEBS reset values (8kB PEBS buffer) compared to the baseline performance. For each bar in the plot,the legend indicates the PEBS reset value and the PEBS buffer sizeused in the given experiment. The general tendency of overheadfor most of the measurements matched our expectations, i.e., themost influential factor in performance overhead is the PEBS resetvalue, whose impact can be relaxed to some extent by adjusting thePEBS buffer size.Across all workloads, we observe the largest overhead on Ge-oFEM (shown in Figure 3a) when running with the lowest PEBSreset value of 64 and the smallest PEBS buffer of 8kB, where theoverhead peaked at 10.2%. Nevertheless, even for GeoFEM a lessaggressive PEBS configuration, e.g., a reset value of 256 with 32kBPEBS buffer size induces only up to 4% overhead.To much of our surprise, on most workloads PEBS’s periodicinterruption of the application does not imply additional overheadas we scale out with the number of compute nodes. In fact, on someof the workloads, e.g., HPCG (shown in Figure 3b) and Lammps(shown in Figure 3c) we even observe a slight decrease in overheadfor which we have currently no precise explanation and for whichidentifying its root cause further investigation is required. Note that both of these workloads were weak scaled and thus are pre-sumed to compute on a quasi-constant amount of per-process datairrespective of scale.One particular application that did experience growing over-head as the scale increased is MiniFE, shown in Figure 3e. MiniFEwas the only workload we ran in strong-scaled configuration andour previous experience with MiniFE indicates that it is indeedsensitive to system noise [7]. Despite the expectation that due tothe decreasing amount of per-process data at larger node countsthe PEBS’ overhead would gradually diminish, the disruption fromconstant PEBS interrupts appears to amplify its negative impact.To demonstrate the impact of PEBS’ reset value on the accu-racy of memory access tracking we provide excerpts on memoryaccess patterns using different reset values. We have been able toobserve similar memory access patterns for all benchmarks tested,but we present the results for MiniFE and Lulesh as an example.Figure 4 and Figure 5 show the heatmaps of the access patternscaptured on 32 nodes for three reset values, 64, 128 and 256. TheX-axis represents the sample set ID, i.e., periods of time betweenPEBS interrupts, while the Y-axis indicates the virtual address ofthe corresponding memory pages. Although PEBS addresses are Time between interrupts (ms) C o un t PEBS Counter Reset

Figure 6: Distribution of elapsed time between PEBS inter-rupts for MiniFE with three different reset values

Number of L2 misses N u m b e r o f d i ff e r e n t p a g e s ( l o g s c a l e ) Figure 7: Access histogram per page for MiniFE execution captured at byte granularity, page size is the minimum unit the OS’memory manager works with. In fact, for better visibility, we showthe heatmap with higher unit sizes, i.e., in blocks of 4 pages.One of the key observations here is the increasingly detailedview of the captured access pattern as we decrease the PEBS resetcounter. As seen, halving the reset value from 128 to 64 gives a 2Xhigher granularity per sample set, e.g., the stride access of MiniFE isstretched with respect to the sample IDs. Note that one iteration ofMiniFE’s buffer presented in the plot corresponds to approximately330ms. To put the accuracy into a more quantitative from the 1536pages of the buffer shown in the figure, PEBS with 64 reset valuereports 1430 pages touched, while using reset values of 128 and 256report 1157 and 843, respectively. To the contrary, Lulesh’s plotsindicate that access patterns that do not significantly change in timecan be captured also with lower granularity and thus the reset valueshould be adjusted dynamically based on the application. Note thatthe number of computational nodes used affects the amount ofmemory each node works with and might alter the visible pattern.However, as long as the memory share per core does not fit in theL2 the patterns will generally remain similar. The implicit effect of altering the PEBS reset counter is the in-crease or decrease rate of the PEBS interrupt frequency, assuminga constant workload. The capacity of controlling the interrupt rateshould have a clear impact on the expected overhead, at least innoise sensitive applications such as minife. We have presented therelationship between overhead and PEBS reset counter in Figure3 and we now show the relationship between PEBS reset counterand interrupt frequency in Figure 6. The elapsed time betweeninterrupts is shown for three executions of MiniFE with 64, 128 and256 values. As expected, we can see a clear correlation betweenthe average duration and the reset counter value being the formersmaller when the later decreases. We also note that the duration ofthe interrupt handler itself took approximately 20 thousand cycles.It is also interesting to observe the formation of two close peaks perexecution. This tendency identifies two different access patternswithin the application that lead to a different L2 miss generationscheme.The presence of particularly hot pages can be easily localized byinspecting the histogram of aggregated L2 misses shown in Figure7. The plot shows the number of different pages that had N numberof L2 misses on the Y-axis, where N is shown on the X-axis. Wecan easily see that most of the pages in MiniFE had a small numberof misses at the leftmost side of the histogram. However, the plotreveals an important group of pages above the 50 L2 misses thatcould be tagged as movable targets.In summary, we believe that our large-scale results well demon-strate PEBS’ modest overhead to online memory access trackingand we think that a PEBS based approach to heterogeneous memorymanagement is worth pursuing.

This section discusses related studies in the domains of heteroge-neous memory management and memory access tracking.Available tools that help to determine efficient data placementin heterogeneous memory systems typically require developersto run a profile phase of their application and modify their codeaccordingly. Dulloor et al. proposed techniques to identify dataobjects to be placed into DRAM in a hybrid DRAM/NVM configu-ration [5]. Peng et al. considered the same problem in the contextof MCDRAM/DRAM using the Intel Xeon Phi processor [19]. Inorder to track memory accesses, these tools often rely on dynamicinstrumentation (such as PIN [16]), which imposes significant per-formance overhead that makes it impractical for online access track-ing.Larysch developed a PEBS system to assess memory bandwidthutilization of applications and reported low overheads, but theauthors did not provide a quantitative characterization of usingPEBS for this purpose [15]. Akiyama et al. evaluated PEBS overheadon a set of enterprise computing workloads with the aim of find-ing performance anomalies in high-throughput applications (e.g.,Spark, RDBMS) [1]. PEBS has been also utilized to determine dataplacement in emulated non-volatile memory based heterogeneoussystems [22]. None of these works, however, have focused on exclu-sively studying PEBS overhead on large-scale configurations. To thecontrary, we explicitly target large-scale HPC workloads to assessthe scalability impacts of PEBS based memory access tracking. lson et al. reported in a very recent study that decreasing thePEBS reset value below 128 on Linux caused the system to crash [18].While they disclosed results only for a single node setup, we demon-strated that our custom PEBS driver in McKernel performs reliablyand induces low overheads even when using small PEBS reset val-ues in a large-scale deployment. This paper has presented the design, implementation and evalua-tion of a PEBS driver for the IHK/McKernel which aims to providethe groundwork for an OS level heterogeneous memory manager.We have shown the captured access patterns of two scientific ap-plications and demonstrated the evolution of their resolution aswe change the PEBS profiling parameters. We have analyzed theoverhead impact associated with the different recording resolutionsin both timing and interrupt domains at scale up to 128k CPUs (or2,048 computer nodes) for six scientific applications. We observedoverheads highly dependent on both the application behavior andthe recording parameters which range between 1% and 10.2%. How-ever, we have been able to substantially reduce the overhead ofour worst-case scenario from 10.2% to 4% by adjusting the record-ing parameters while still achieving clearly visible access patterns.Our experience contrast with the current Linux kernel PEBS im-plementation which is not capable of achieving very fine-grainedsample rates. We conclude that PEBS efficiency matches the basic re-quirements to be feasible for heterogeneous memory managementbut further work is necessary to quantify the additional overheadassociated with using the recorded data at runtime.Our immediate future work is to address the challenge of prop-erly using the recorded addresses at runtime to reorganize memorypages on memory devices based on access patterns. We will studythe benefits of dedicating a hardware thread to periodically harvestthe CPU PEBS buffer instead of relying on interrupts that constantlypause the execution of the user processes. We also intend to deeplyanalyze the difference between the IHK/McKernel PEBS driver andthe Linux kernel driver to better quantify the observed limitations.

ACKNOWLEDGMENT

This work has been partially funded by MEXT’s program for theDevelopment and Improvement of Next Generation Ultra High-Speed Computer Systems under its subsidies for operating theSpecific Advanced Large Research Facilities in Japan. This projecthas received funding from the European Union’s Horizon 2020research and innovation programme under the Marie Sklodowska-Curie grant agreement No 708566 (DURO) and agreement No 754304(DEEP-EST).

REFERENCES [1] Soramichi Akiyama and Takahiro Hirofuchi. 2017. Quantitative Evaluation ofIntel PEBS Overhead for Online System-Noise Analysis. In

Proceedings of the 7thInternational Workshop on Runtime and Operating Systems for SupercomputersROSS 2017 (ROSS ’17) . ACM, New York, NY, USA, Article 3, 8 pages.[2] CORAL. 2013. Benchmark Codes. https://asc.llnl.gov/CORAL-benchmarks/. (Nov.2013).[3] Intel Corporporation. 2018. Intel 64 and IA-32 Architectures Software DeveloperManuals. https://software.intel.com/articles/intel-sdm. (2018).[4] Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015.

HPCG Benchmark: ANew Metric for Ranking High Performance Computing Systems . Technical Report UT-EECS-15-736. University of Tennessee, Electrical Engineering and ComputerScience Department.[5] Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram,Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. DataTiering in Heterogeneous Memory Systems. In

Proceedings of the Eleventh Euro-pean Conference on Computer Systems (EuroSys ’16) . ACM, New York, NY, USA,Article 15, 16 pages. http://doi.acm.org/10.1145/2901318.2901344[6] Kurt B. Ferreira, Patrick Bridges, and Ron Brightwell. 2008. CharacterizingApplication Sensitivity to OS Interference Using Kernel-level Noise Injection. In

Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC ’08) . IEEEPress, Piscataway, NJ, USA, Article 19, 12 pages.[7] Balazs Gerofi, Rolf Riesen, Masamichi Takagi, Taisuke Boku, Yutaka Ishikawa,and Robert W. Wisniewski. 2018 (to appear). Performance and Scalability ofLightweight Multi-Kernel based Operating Systems. In .[8] Balazs Gerofi, Akio Shimada, Atsushi Hori, and Yutaka Ishikawa. 2013. PartiallySeparated Page Tables for Efficient Operating System Assisted Hierarchical Mem-ory Management on Heterogeneous Architectures. In .[9] B. Gerofi, M. Takagi, A. Hori, G. Nakamura, T. Shirasawa, and Y. Ishikawa. 2016.On the Scalability, Performance Isolation and Device Driver Transparency of theIHK/McKernel Hybrid Lightweight Kernel. In . 1041–1050.[10] V. E. Henson and U. M. Yang. 2002. BoomerAMG: A Parallel Algebraic MultigridSolver and Preconditioner. https://codesign.llnl.gov/amg2013.php.

Appl. Num.Math.

41 (2002), 155–177.[11] Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring,H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thorn-quist, and Robert W Numrich. 2009.

Improving Performance via Mini-applications .Technical Report SAND2009-5574. Sandia National Laboratories.[12] Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizingthe Influence of System Noise on Large-Scale Applications by Simulation. In

Proceedings of the 2010 ACM/IEEE International Conference for High PerformanceComputing, Networking, Storage and Analysis (SC ’10) . IEEE Computer Society,Washington, DC, USA. https://doi.org/10.1109/SC.2010.12[13] Joint Center for Advanced HPC (JCAHPC). 2017. Basic Specification of Oakforest-PACS. http://jcahpc.jp/files/OFP-basic.pdf. (March 2017).[14] Ian Karlin, Jeff Keasler, and Rob Neely. 2013.

LULESH 2.0 Updates and Changes .Technical Report LLNL-TR-641973. Lawrence Livermore National Laboratory.1–9 pages.[15] Florian Larysch. 2016.

Fine-Grained Estimation of Memory Bandwidth Utilization .Master Thesis. Operating Systems Group, Karlsruhe Institute of Technology(KIT), Germany.[16] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, GeoffLowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin:Building Customized Program Analysis Tools with Dynamic Instrumentation.In

Proceedings of the 2005 ACM SIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI ’05) . ACM, New York, NY, USA, 190–200.[17] Kengo Nakajima. 2003. Parallel Iterative Solvers of GeoFEM with SelectiveBlocking Preconditioning for Nonlinear Contact Problems on the Earth Simulator.In

Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (SC) . ACM,New York, NY, USA. https://doi.org/10.1145/1048935.1050164[18] Matthew Benjamin Olson, Tong Zhou, Michael R. Jantz, Kshitij A. Doshi, M. Gra-ham Lopez, and Oscar Hernandez. 2018. MemBrain: Automated ApplicationGuidance for Hybrid Memory Systems. In

IEEE International Conference on Net-working, Architecture, and Storage (NAS’ 18) . (to appear).[19] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, andStefano Markidis. 2017. RTHMS: A Tool for Data Placement on Hybrid MemorySystem. In

Proceedings of the 2017 ACM SIGPLAN International Symposium onMemory Management (ISMM 2017) . ACM, New York, NY, USA, 82–91.[20] Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dy-namics. (March 1995), 19 pages. https://doi.org/10.1006/jcph.1995.1039[21] Taku Shimosawa, Balazs Gerofi, Masamichi Takagi, Gou Nakamura, TomokiShirasawa, Yuji Saeki, Masaaki Shimizu, Atsushi Hori, and Yutaka Ishikawa. 2014.Interface for Heterogeneous Kernels: A Framework to Enable Hybrid OS Designstargeting High Performance Computing on Manycore Architectures. In .[22] Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime Data Manage-menton Non-volatile Memory-based Heterogeneous Main Memory. In

Proceed-ings of the International Conference for High Performance Computing, Networking,Storage and Analysis (SC ’17) . ACM, New York, NY, USA, Article 58, 14 pages.. ACM, New York, NY, USA, Article 58, 14 pages.