[PDF] NumaPerf: Predictive and Full NUMA Profiling

Abstract

Parallel applications are extremely challenging to achieve the optimal performance on the NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool - NumaPerf - that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also separates cache coherence issues that may require different fix strategies. Based on our extensive evaluation, NumaPerf is able to identify more performance issues than any existing tool, while fixing these bugs leads to up to 5.94x performance speedup.

Full PDF

NNumaPerf: Predictive and Full NUMA Profiling

Xin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts AmherstWei Wang, University of Texas at San AntonioXu Liu, North Carolina State UniversityTongping Liu, University of Massachusetts Amherst

Abstract

NumaPerf –that over-comes these issues.

NumaPerf aims to identify potentialperformance issues for any NUMA architecture, instead ofonly on the current hardware. To achieve this,

NumaPerf focuses on memory sharing patterns between threads, insteadof real remote accesses.

NumaPerf further detects poten-tial thread migrations and load imbalance issues that couldsignificantly affect the performance but are omitted by ex-isting profilers.

NumaPerf also separates cache coherenceissues that may require different fix strategies. Based on ourextensive evaluation,

NumaPerf is able to identify moreperformance issues than any existing tool, while fixing thesebugs leads to up to . × performance speedup. The Non-Uniform Memory Access (NUMA) is the de factodesign to address the scalability issue with an increased num-ber of hardware cores. Compared to the Uniform MemoryAccess (UMA) architecture, the NUMA architecture avoidsthe bottleneck of one memory controller by allowing eachnode/processor to concurrently access its own memory con-troller. However, the NUMA architecture imposes multiplesystem challenges for writing efficient parallel applications,such as remote accesses, interconnect congestion, and nodeimbalance [5]. User programs could easily suffer from signifi-cant performance degradation, necessitating the developmentof profiling tools to identify NUMA-related performance is-sues.General-purpose profilers, such as gprof [12], perf [11],or

Coz [9], are not suitable for identifying NUMA-relatedperformance issues [24, 30] because they are agnostic to thearchitecture difference. To detect NUMA-related issues, onetype of tools simulates cache activities and page affinity basedon the collected memory traces [29, 33]. However, they mayintroduce significant performance slowdown, preventing theiruses even in development phases. In addition to this, anothertype of profilers employs coarse-grained sampling to identifyperformance issues in the deployment environment [14, 18,24, 26, 32, 35], while the third type builds on fine-grained instrumentation that could detect more performance issuesbut with a higher overhead [10, 30].However, the latter two types of tools share the following common issues . First, they mainly focus on one type of per-formance issues (i.e., remote accesses), while omitting othertypes of issues that may have a larger performance impact . Second, they have limited portability that can only identify re-mote accesses on the current NUMA hardware . The major rea-son is that they rely on the physical node information to detectremote accesses, where the physical page a thread accesses islocated in a node that is different from the node of the currentthread. However, the relationship between threads/pages withphysical nodes can be varied when an application is runningon different hardware with different topology, or even on thesame hardware at another time. That is, existing tools maymiss some remote accesses caused by specific binding.

Third,existing tools could not provide sufficient guidelines for bugfixes . Users have to spend significant effort to figure out thecorresponding fix strategy by themselves.This paper proposes a novel tool—

NumaPerf —that over-comes these issues.

NumaPerf is designed as an automatictool that does not require human annotation or the changeof the code. It also does not require new hardware, or thechange of the underlying operating system.

NumaPerf aimsto detect NUMA-related issues in development phases, whenapplications are exercised with representative inputs. In thisway, there is no need to pay additional and unnecessary run-time overhead in deployment phases. We further describe

NumaPerf ’s distinctive goals and designs as follows.First,

NumaPerf aims to detect some additional types ofNUMA performance issues, while existing NUMA profil-ers could only detect remote access. The first type is loadimbalance among threads, which may lead to memory con-troller congestion and interconnect congestion. The secondtype is cross-node migration, which turns all previous localaccesses into remote accesses. Based on our evaluation, cross-node migration may lead to . × performance degradationfor fluidanimate . However, some applications may nothave such issues, which requires the assistance of profilingtools.Second, it proposes a set of architecture-independent andscheduling-independent mechanisms that could predictivelydetect the above-mentioned issues on any NUMA architec-ture, even without running on a NUMA machine. NumaPerf ’sdetection of remote accesses is based on a key observation : a r X i v : . [ c s . PF ] F e b onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst memory sharing pattern of threads is an invariant determinedby the program logic, but the relationship between threads/-pages and physical nodes is architecture and scheduling de-pendent. Therefore, NumaPerf focuses on identifying mem-ory sharing pattern between threads, instead of the specificnode relationship of threads and pages , since a thread/pagecan be scheduled/allocated to/from a different node in a dif-ferent execution. This mechanism not only simplifies thedetection problem (without the need to track the node in-formation), but also generalizes to different architecturesand executions (scheduling).

NumaPerf also proposes anarchitecture-independent mechanism to measure load imbal-ance based on the total number of memory accesses fromthreads : when different types of threads have a different num-ber of total memory accesses, then this application has aload imbalance issue.

NumaPerf further proposes a methodto predict the probability of thread migrations.

NumaPerf computes a migration score based on the contending num-ber of synchronizations, and the number of condition andbarrier waits. Overall,

NumaPerf predicts a set of NUMAperformance issues without the requirement of testing on aNUMA machine, where its basic ideas are further discussedin Section 2.2.Last but not least,

NumaPerf aims to provide more help-ful information to assist bug fixes. Firstly, it proposes a setof metrics to measure the seriousness of different perfor-mance issues, preventing programmers from spending un-necessary efforts on insignificant issues. Secondly, its reportcould guide users for a better fix. For load imbalance issues,

NumaPerf suggests a thread assignment that could achievemuch better performance than existing work [1]. For remoteaccesses, there exist multiple fix strategies with different lev-els of improvement. Currently, programmers have to figureout a good strategy by themselves. In contrast,

NumaPerf supplies more information to assist fixes. It separates cachefalse sharing issues from true sharing and page sharing sothat users can use the padding to achieve better performance.It further reports whether the data can be duplicated or not byconfirming the temporal relationship of memory reads/writes.It also reports threads accessing each page, which helps con-firm whether a block-wise interleave with the thread bindingwill have a better performance improvement.We performed extensive experiments to verify the effec-tiveness of

NumaPerf with widely-used parallel applications(i.e., PARSEC [4]) and HPC applications (e.g., AMG2006 [17],Lulesh [15], and UMT2003 [16]). Based on our evaluation,

NumaPerf detects many more performance issues than thecombination of all existing NUMA profilers, including bothfine-grained and coarse-grained tools. After fixing such issues,these applications could achieve up to . × performance im-provement. NumaPerf ’s helpfulness on bug fixes is alsoexemplified by multiple case studies. Overall,

NumaPerf imposes less than × performance overhead, which is or-ders of magnitude faster than the previous state-of-the-art in the fine-grained analysis. The experiments also confirm that NumaPerf ’s detection is architecture-independent, which isable to identify most performance issues when running on anon-NUMA machine.Overall,

NumaPerf makes the following contributions. • NumaPerf proposes a set of architecture-independentand scheduling-independent methods that could predic-tively detect NUMA-related performance issues, evenwithout evaluating on a specific NUMA architecture. • NumaPerf is able to detect a comprehensive set ofNUMA-related performance issues, where some areomitted by existing tools. • NumaPerf designs a set of metrics to measure theseriousness of performance issues, and provides helpfulinformation to assist bug fixes. • We have performed extensive evaluations to confirm

NumaPerf ’s effectiveness and overhead.

Outline

The remainder of this paper is organized as follows. Section 2introduces the background of NUMA architecture and the ba-sic ideas of

NumaPerf . Then Section 3 presents the detailedimplementation and Section 4 shows experimental results.After that, Section 5 explains the limitation and Section 6discusses related work in this field. In the end, Section 7concludes this paper.

This section starts with the introduction of the NUMA ar-chitecture and potential performance issues. Then it brieflydiscusses the basic idea of

NumaPerf to identify such issues.

Traditional computers use the Uniform Memory Access (UMA)model. In this model, all CPU cores share a single memorycontroller such that any core can access the memory with thesame latency (uniformly). However, the UMA architecturecannot accommodate the increasing number of cores becausethese cores may compete for the same memory controller.The memory controller becomes the performance bottleneckin many-core machines since a task cannot proceed withoutgetting its necessary data from the memory.The Non-Uniform Memory Access (NUMA) architectureis proposed to solve this scalability issue, as further shownin Figure 1. It has a decentralized nature. Instead of mak-ing all cores waiting for the same memory controller, theNUMA architecture is typically equipped with multiple mem-ory controllers, where each controller serves a group of CPUcores (called a “node” or “processor” interchangeably). In-corporating multiple memory controllers largely reduces thecontention for memory controllers and therefore improves umaPerf: Predictive and Full NUMA Profiling Conference’17, July 2017, Washington, DC, USA D R A M Processor 1

Core 1 Core 2 Core N …… D R A M Processor 3

Core 1 Core 2 Core N …… D R A M Processor 2

Core 1 Core 2 Core N …… Processor 4

Core 1 Core 2 Core N …… D R A M Domain 1 Domain 3 Domain 4 Domain 2

Figure 1.

A NUMA architecture with four nodes/domainsthe scalability correspondingly. However, the NUMA archi-tecture also introduce multiple sources of performance degra-dations [5], including

Cache Contention , Node Imbalance , Interconnect Congestion , and

Remote Accesses . Cache Contention: the NUMA architecture is prone tocache contention, including false and true sharing. False shar-ing occurs when multiple tasks may access distinct words inthe same cache line [3], while different tasks may access thesame words in true sharing. For both cases, multiple tasksmay compete for the shared cache. Cache contention willcause more serious performance degradation, if data has tobe loaded from a remote node.

Node Imbalance:

When some memory controllers havemuch more memory accesses than others, it may cause thenode imbalance issue. Therefore, some tasks may wait moretime for memory access, thwarting the whole progress of amultithreaded application.

Interconnect Congestion:

Interconnect congestion occursif some tasks are placed in remote nodes that may use theinter-node interconnection to access their memory.

Remote Accesses:

In a NUMA architecture, local nodescan be accessed with less latency than remote accesses. There-fore, it is important to reduce remote access to improve per-formance.

Existing NUMA profilers mainly focus on detecting remoteaccesses, while omitting other performance issues. In con-trast,

NumaPerf has different design goals as follows. First,it aims to identify different sources of NUMA performance is-sues, not just limited to remote accesses. Second,

NumaPerf aims to design architecture- and scheduling-independent ap-proaches that could report performance issues in any NUMAhardware. Third, it aims to provide sufficient information toguide bug fixes.For the first goal,

NumaPerf detects NUMA issues causedby cache contention, node imbalance, interconnect conges-tion, and remote accesses, where existing work only considersremote accesses.

Cache contention can be either caused by false or true sharing, which will impose a larger performanceimpact and require a different fix strategy. Existing worknever separates them from normal remote accesses. In con-trast,

NumaPerf designs a separate mechanism to detectsuch issues, but tracking possible cache invalidations causedby cache contention. It is infeasible to measure all node imbal-ance and interconnect congestion without knowing the actualmemory and thread binding. Instead,

NumaPerf focuses onone specific type of issues, which is workload imbalance be-tween different types of threads. Existing work omits one typeof remote access caused by thread migration, where threadmigration will make all local accesses remotely.

NumaPerf identifies whether an application has a higher chance of threadmigrations, in addition to normal remote accesses. Overall,

NumaPerf detects more NUMA performance issues thanexisting NUMA profilers. However, the challenge is to designarchitecture- and scheduling-independent methods.The second goal of

NumaPerf is to design architecture-and scheduling approaches that do not bind to specific hard-ware. Detecting remote accesses is based on the key observa-tion of Section 1: if a thread accesses a physical page that wasinitially accessed by a different thread, then this access will becounted as remote access. This method is not bound to specifichardware, since memory sharing patterns between threads aretypically invariant across multiple executions.

NumaPerf tracks every memory access in order to identify the firstthread working on each page. Due to this reason,

NumaPerf employs fine-grained instrumentation, since coarse-grainedsampling may miss the access from the first thread. Basedon memory accesses,

NumaPerf also tracks the number ofcache invalidations caused by false or true sharing with thefollowing rule: a write on a cache line with multiple copieswill invalidate other copies. Since the number of cache invali-dations is closely related to the number of concurrent threads,

NumaPerf divides the score with the number of threads toachieve a similar result with a different number of concurrentthreads, as further described in Section 3.2.3. Load imbalancewill be evaluated by the total number of memory accesses ofdifferent types of threads. It is important to track all memoryaccesses including libraries for this purpose. To evaluate thepossibility of thread migration,

NumaPerf proposes to trackthe number of lock contentions and the number of conditionand barrier waits. Similar to false sharing,

NumaPerf elimi-nates the effect caused by concurrent threads by dividing withthe number of threads. The details of these implementationscan be seen in Section 3 .For the third goal,

NumaPerf will utilize the data-centricanalysis as existing work [24]. That is, it could report thecallsite of heap objects that may have NUMA performanceissues. In addition,

NumaPerf aims to provide useful infor-mation that helps bug fixes, which could be easily achievedwhen all memory accesses are tracked.

NumaPerf providesword-based access information for cache contentions, helping onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst programmers to differentiate false or true sharing. It pro-vides threads information on page sharing (help determiningwhether to use block-wise interleave), and reports whetheran object can be duplicated or not by tracking the temporalread/write pattern.

NumaPerf also predicts a good thread as-signment to achieve better performance for load imbalance is-sues. In summary, many of these features require fine-grainedinstrumentation in order to avoid false alarms.Due to the reasons mentioned above,

NumaPerf utilizesfine-grained memory accesses to improve the effectivenessand provide better information for bug fixes.

NumaPerf em-ploys compiler-based instrumentation in order to collect mem-ory accesses due to the performance and flexibility concern.An alternative approach is to employ binary-based dynamicinstrumentation [7, 25, 27], which may introduce more per-formance overhead but without an additional compilationstep.

NumaPerf inserts an explicit function call for eachread/write access on global variables and heap objects, whileaccesses on stack variables are omitted since they typicallydo not introduce performance issues. To track thread migra-tion,

NumaPerf also intercepts synchronizations. To supportdata-centric analysis,

NumaPerf further intercepts memoryallocations to collect their callsites.

NumaPerf-StaticProgram Access Instrumentation NumaPerf-DynamicCompile Link ReportTrack Accesses TrackSyncsDetection EngineTrackAllocations

Figure 2.

Overview of

NumaPerf

Figure 2 summarizes

NumaPerf ’s basic idea.

NumaPerf includes two components,

NumaPerf -Static and

NumaPerf -Dynamic.

NumaPerf -Static is a static compile-time basedtool that inserts a function call before every memory accesson heap and global variables, which compiles a program intoan instrumented executable file. Then this executable file willbe linked to

NumaPerf -Dynamic so that

NumaPerf couldcollect memory accesses, synchronizations, and informationof memory allocations.

NumaPerf then performs detectionon NUMA-related performance issues, and reports to usersin the end. More specific implementations are discussed inSection 3.

This section elaborates

NumaPerf -Static and

NumaPerf -Dynamic.

NumaPerf leverages compiler-based instrumen-tation (

NumaPerf -Static) to insert a function call beforememory access, which allows

NumaPerf -Dynamic to col-lect memory accesses.

NumaPerf utilizes a pre-load mecha-nism to intercept synchronizations and memory allocations,without the need of changing programs explicitly. Detaileddesign and implementation are discussed as follows.

NumaPerf -Static

NumaPerf ’s static component (

NumaPerf -Static) performsthe instrumentation on memory accesses. In particular, it uti-lizes static analysis to identify memory accesses on heap andglobal variables, while omitting memory accesses on staticvariables. Based on our understanding, static variables willnever cause performance issues, if a thread is not migrated.

NumaPerf -Static inserts a function call upon these memoryaccesses, where this function is implemented in

NumaPerf -Dynamic library. In particular, this function provides detailedinformation on the access, including the address, the type (i.e.,read or write), and the number of bytes.

NumaPerf employs the LLVM compiler to perform theinstrumentation [20]. It chooses the intermediate representa-tion (IR) level for the instrumentation due to the flexibility,since LLVM provides lots of APIs and tools to manipulatethe IR. The instrumentation pass is placed at the end of theLLVM optimization passes, where only memory accesses sur-viving all previous optimization passes will be instrumented.

NumaPerf -Static traverses functions one by one, and instru-ments memory accesses on global and heap variables. Theinstrumentation is adapted from AddressSanitizer [31].

NumaPerf -Dynamic

This subsection starts with tracking application information,such as memory accesses, synchronizations, and memoryallocations. Then it discusses the detection of each particularperformance issue. In the following,

NumaPerf is used torepresent

NumaPerf -Dynamic unless noted otherwise.

NumaPerf -Dynamic implements the in-serted functions before memory accesses, allowing it to trackmemory accesses. Once a memory access is intercepted,

NumaPerf performs the detection as discussed below.

NumaPerf utilizes a preloading mechanism to interceptsynchronizations and memory allocations before invoking cor-respond functions.

NumaPerf intercepts synchronizations inorder to detect possible thread migrations, which will be ex-plained later.

NumaPerf also intercepts memory allocations,so that we could attribute performance issues to different call-sites, assisting data-centric analysis [24]. For each memoryallocation,

NumaPerf records the allocation callsite and itsaddress range.

NumaPerf also intercepts thread creations umaPerf: Predictive and Full NUMA Profiling Conference’17, July 2017, Washington, DC, USA in order to set up per-thread data structure. In particular, itassigns each thread a thread index.

NumaPerf de-tects a remote access when an access’s thread is differentfrom the corresponding page’s initial accessor , as discussedin Section 2. This is based on the assumption that the OStypically allocates a physical page from the node of the firstaccessor due to the default first-touch policy [19]. Similar toexisting work,

NumaPerf may over-estimate the number ofremote accesses, since an access is not a remote one if thecorresponding cache is not evicted. However, this shortcom-ing can be overcome easily by only reporting issues largerthan a specified threshold, as exemplified in our evaluation(Section 4).

NumaPerf is carefully designed to reduce its performanceand memory overhead.

NumaPerf tracks a page’s initialaccessor to determine a remote access. A naive design isto employ hash table for tracking such information. Instead,

NumaPerf maps a continuous range of memory with theshadow memory technique [34], which only requires a simplecomputation to locate the data.

NumaPerf also maintainsthe number of accesses for each page in the same map. Weobserved that a page without a large number of memoryaccesses will not cause significant performance issues. Basedon this,

NumaPerf only tracks the detailed accesses for apage, when its number of accesses is larger than a pre-defined(configurable) threshold. Since the recording uses the samedata structures,

NumaPerf uses an internal pool to maintainsuch data structures with the exact size, without resorting tothe default allocator.For pages with excessive accesses,

NumaPerf tracks thefollowing information.

First , it tracks the threads accessingthese pages, which helps to determine whether to use block-wise allocations for fixes.

Second , NumaPerf further divideseach page into multiple blocks (e.g., 64 blocks), and tracksthe number of accesses on each block. This enables us tocompute the number of remote accesses of each object moreaccurately.

Third , NumaPerf further checks whether an ob-ject is exclusively read after the first write or not, which couldbe determined whether duplication is possible or not.

Last notleast , NumaPerf maintains word-level information for cachelines with excessive cache invalidations, as further describedin Section 3.2.3.

Remote (Access) Score:

NumaPerf proposes a perfor-mance metric – remote score – to evaluate the seriousness ofremote accesses. An object’s remote score is defined as thenumber of remote accesses within a specific interval, whichis currently set as one millisecond. Typically, a higher scoreindicates more seriousness of remote accesses, as shown inTable 1. For pages with both remote accesses and cache invali-dations, we will check whether cache invalidation is dominantor not. If the number of cache invalidations is larger than 50%of remote accesses, then the major performance issue of this page is caused by cache invalidations. We will omit remoteaccesses instead.

Based onour observation, cache coherence has a higher performanceimpact than normal remote accesses. Further, false sharinghas a different fixing strategy, typically with the padding.

NumaPerf detects false and true sharing separately, whichis different from all NUMA profilers.

NumaPerf detects false/true sharing with a similar mecha-nism as Predator [23], but adapting it for the NUMA architec-ture. Predator tracks cache validations as follows: if a threadwrites a cache line that is loaded by multiple threads, this writeoperation introduces a cache invalidation. But this mechanismunder-estimates the number of cache invalidations. Instead,

NumaPerf tracks the number of threads loaded the samecache line, and increases cache invalidations by the numberof threads that has loaded this cache line.

False/True Sharing Score:

NumaPerf further proposesfalse/true sharing scores for each corresponding object, whichis lacked in Predator [23]. The scores are computed by di-viding the number of cache invalidations with the product oftime (milliseconds) and the number of threads. The number ofthreads is employed to reduce the impact of parallelization de-gree, with the architecture-independent method.

NumaPerf differentiates false sharing from true sharing by recordingword-level accesses. Note that

NumaPerf only records word-level accesses for cache lines with the number of writes largerthan a pre-defined threshold, due to the performance concern.

Asdiscussed in Section 1,

NumaPerf identifies applicationswith excessive thread migrations, which are omitted by allexisting NUMA profilers. Thread migration may introduceexcessive remote accesses. After the migration, a thread isforced to reload all data from the original node, and access itsstack remotely afterwards. Further, all deallocations from thisthread may be returned to freelists of remote nodes, causingmore remote accesses afterwards.

Thread Migration Score:

NumaPerf evaluates the seri-ousness of thread migrations with thread migration scores.This score is computes as the following formula: 𝑆 = 𝑝 ∑︁ 𝑡 ∈ 𝑇 𝑚 𝑡 /( 𝑟𝑡 · | 𝑇 |) where 𝑆 is the thread migration score, 𝑝 is the parallel phasepercentage of the program, 𝑇 is threads in the program, | 𝑇 | isthe number of total threads, 𝑚 𝑡 is the possible migration timesfor thread 𝑡 , and 𝑟𝑡 is total running seconds of the program. NumaPerf utilizes the total number of lock contentions,condition waits, and barrier waits as the possible migrationtimes. The parallel phase percentage indicates the necessarityof performing the optimization. For instance, if the parallelphase percentage is only 1%, then we could at most improve onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst the performance by 1%. In order to reduce the effect of par-allelization, the score is further divided by the number ofthreads. Based on our evaluation, this parameter makes twoplatforms with different number of threads have very similarresults.When an application has a large number of thread migra-tions,

NumaPerf suggests users to utilize thread bindingto reduce remote accesses. As shown in Table 1, thread mi-gration may degrade the performance of an application (i.e., fluidanimate ) by up to 418%. This shows the importanceto eliminate thread migration for such applications. However,some applications in PARSEC (as not shown in Table 1) havevery marginal performance improvement with thread binding.

Load imbalance is an-other factor that could significantly affect the performance onthe NUMA architecture, which could cause node imbalanceand interconnect congestion.

NumaPerf detects load imbal-ance among different types of threads, which is omitted byexisting NUMA-profilers.The detection is based on an assumption: every type ofthreads should have a similar number of memory accesses ina balanced environment . NumaPerf proposes to utilize thenumber of memory accesses to predict the workload of eachtypes of threads. In particular,

NumaPerf monitors memoryaccesses on heap objects and globals, and then utilizes thesum of such memory accesses to check the imbalance.

NumaPerf further predicts an optimal thread assignmentwith the number of memory accesses. A balance assignmentis to balance memory accesses from each type of threads.For instance, if the number of memory accesses on two typeof threads has a one-to-two portion, then

NumaPerf willsuggest to assign threads in one-to-two portion. Section 4.2further evaluates

NumaPerf ’s suggested assignment, where

NumaPerf significantly outperforms another work [1].

This section aims to answer the following research questions: • Effectiveness:

Whether

NumaPerf could detect moreperformance issues than existing NUMA-profilers? (Sec-tion 4.1) How helpful of

NumaPerf ’s detection re-port? (Section 4.2) • Performance:

How much performance overhead is im-posed by

NumaPerf ’s detection, comparing to thestate-of-the-art tool? (Section 4.3) • Memory Overhead:

What is the memory overhead of

NumaPerf ? (Section 4.4) • Architecture In-dependence:

Whether

NumaPerf could detect similar issues when running on a non-NUMA architecture? (Section 4.5)

Experimental Platform:

NumaPerf was evaluated on amachine with 8 nodes and 128 physical cores in total, exceptin Section 4.5. This machine is installed with 512GB memory.Any two nodes in this machine are less than or equal to 3 hops, where the latency of two hops and three hops is 2.1 and3.1 separately, while the local latency is 1.0. The OS for thismachine is Linux Debian 10 and the compiler is GCC-8.3.0.The hyperthreading was turned off for the evaluation.

We evaluated

NumaPerf on multiple HPC applications (e.g.,AMG2006 [17], lulesh [15], and UMT2013 [16]) and a widely-used multithreaded application benchmark suite — PAR-SEC [4]. Applications with NUMA performance issues arelisted in Table 1. The performance improvement after fixingall issues is listed in “Improve” column, with the average of10 runs, where all specific issues are listed afterwards. Foreach issue, the table listed the type of issue and the corre-sponding score, the allocation site, and the fix strategy. Notethat the table only shows cases with page sharing score largerthan 1500 (if without cache false/true sharing), false/true shar-ing score larger than 1, and thread migration score larger than150. Further, the performance improvement of each specificissue is listed as well. We also present multiple cases studiesthat show how

NumaPerf ’s report is able to assist bug fixesin Section 4.2.Overall, we have the following observations. First, it re-ports no false positives by only reporting scores larger thana threshold. Second,

NumaPerf detects more performanceissues than the combination of all existing NUMA profil-ers [10, 14, 18, 24, 26, 30, 32, 35]. The performance issuesthat cannot be detected by existing NUMA profilers are high-lighted with a check mark in the last column of the table, al-though some can be detected by specific tools, such as cachefalse/true sharing issues [8, 13, 21–23]. This comparison withexisting NUMA profilers is based on the methodology. Exist-ing NUMA profilers cannot separate false or true sharing withnormal remote accesses, and cannot detect thread migrationand load imbalance issues.When comparing to a specific profiler,

NumaPerf also hasbetter results even on detecting remote accesses. For lulesh,HPCToolkit detects issues of

NumaPerf detects three more issues (

NumaPerf ’s predictive method detects someissues that are not occurred in the current scheduling andthe current hardware, while HPCToolkit has no such ca-pabilities. Second, HPCToolkit requires to bind threads tonodes, which may miss remote accesses caused by its spe-cific binding. Third,

NumaPerf ’s fine-grained profiling pro-vides a better effectiveness than a coarse-grained profiler likeHPCToolkit.

NumaPerf may have false negatives causedby its instrumentation.

NumaPerf cannot detect an issue ofUMT2013 reported by HPCToolkit [24]. The basic reason isthat

NumaPerf cannot instrument Fortran code.

NumaPerf ’slimitations are further discussed in Section 4.2. umaPerf: Predictive and Full NUMA Profiling Conference’17, July 2017, Washington, DC, USA

Application Improve Specific Issues ✓ lulesh 594% 3 remote access 1840 lulesh.cc:543-545 block interleave 429%4 remote access 1504 lulesh.cc:1029-1034 block interleave 504%5 remote access 4496 lulesh.cc:2251-2264 block interleave 406% 418%6 false sharing 26 padding 103% ✓ ✓ ✓ UMT2013 131% 10 thread migration 18 thread binding 131% ✓ bodytrack 109% 11 remote access 10800 FlexImageStore.h:146 page interleave 106%12 false sharing 24 ✓

13 thread migration 297 thread binding 105% ✓ dedup 116% 14 thread imbalance adjust threads 116% ✓ facesim 105% 15 thread migration 607 thread binding 105% ✓ ferret 206% 16 thread imbalance adjust threads 206% ✓ fluidanimate 429% 17 remote access 90534 pthreads.cpp:292 page interleave 340%18 true sharing 2941 ✓

19 remote access 180 pthreads.cpp:294 page interleave 112% 160%20 false sharing 20 padding 158% ✓

21 thread migration 73 thread binding 418% ✓ streamcluster 167% 22 remote access 427 streamcluster.cpp:984 page interleave 100% 103%23 false sharing 31 padding 102% ✓

24 remote access 7169 streamcluster.cpp:1845 duplicate 158%25 thread migration 229 thread binding 132% ✓ Table 1.

Detected NUMA performance issues when running on an 8-node NUMA machine.

NumaPerf detects 15 moreperformance bugs that cannot be detected by existing NUMA profilers (with a check mark in the last column).

In this section, multiple case studies are shown how program-mers could fix performance issues based on the report.

For remote accesses,

NumaPerf not only reports remote access scores, indicating the seri-ousness of the corresponding issue, but also provides addi-tional information to assist bug fixes. Remote accesses can befixed with different strategies, such as padding (false sharing),block-wise interleaving, duplication, and page interleaving.

Allocation Site: lulesh.cc:2251Remote score: 4496False sharing score: 26True Sharing score: 0.00Pages accessed by threads:0--8, 8--16, 16--23, 23--31 ......

Listing 1.

Remote access issue of lulesh

NumaPerf provides a data-centric analysis, as existingwork [24]. That is, it always attributes performance issues toits allocation callsite.

NumaPerf also shows the seriousnesswith its remote access score.

NumaPerf further reports more specific information toguide the fix. As shown in Listing 1,

NumaPerf further re-ports each page that are accessed by which threads. Based onthis information, block-wise interleave is a better strategy forthe fix, which achieves a better performance result. However,for Issue 17 or 19 of luresh , there is no such access pat-tern. Therefore, these bugs can be fixed with the normal pageinterleave method.

Allocation site:streamcluster.cpp:1845Remote score: 7169False sharing score: 0.00True Sharing score: 0.00Continuous reads after the last write: 2443582804

Listing 2.

Remote access issue of streamclusterListing 2 shows another example of remote accesses. Forthis issue ( onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst

For cache coherency issues,

NumaPerf differentiates themfrom normal remote accesses, and further differentiates falsesharing from true sharing. Given the report, programmerscould utilize the padding to eliminate false sharing issues.As shown in Table 4, many issues have false sharing issues(e.g.,

When an application has frequentthread migrations, it may introduce excessive thread migra-tions. For such issues, the fix strategy is to bind threads tonodes. Typically, there are two strategies: round robin andpacked binding. Round robin is to bind continuous threads todifferent nodes one by one, ensuring that different nodes havea similar number of threads. Packed binding is to bind multi-ple threads to the first node, typically the same as the numberof hardware cores in one node, and then to another node af-terwards. Based on our observation, round robin typicallyachieves a better performance than packed binding, whichis the default binding policy for our evaluations in Table 1.Thread binding itself achieves the performance improvementby up to 418% (e.g., fluidanimate ), which indicates theimportance for some applications.

NumaPerf not only reports theexistence of such issues, but also suggests an assignmentbased on the number of sampled memory accesses. Program-mers could fix them based on the suggestion.For dedup , NumaPerf reports that memory accesses ofanchor, chunk, and compress threads have a proportion of92.2:0.33:3.43, when all libraries are instrumented. That is,the portion of the chunk and compress threads is around1 to 10. By checking the code, we understand that dedup has multiple stages, where the anchor is the previous stageof the chunk, and the chunk is the predecessor of the com-press. Threads of a previous stage will store results into mul-tiple queues, which will be consumed by threads of its nextstage. Based on a common sense that many threads competingfor the same queue may actually introduce high contention.Therefore, the fix will simply set the number of chunk threadsto be 2. Based on this, we further set the number of compressthreads to be 18, and the number of anchor to be 76. The cor-responding queues are 18:2:2:4. With this setting, dedup ’sperformance is improved by 116%. We further compare itsperformance with the suggested assignment of another exist-ing work–SyncPerf [1]. SyncPerf assumes that different typesof threads should have the same waiting time. SyncPerf pro-poses the best assignment should be 24:24:48, which couldonly improve the performance by 105%.In another example of ferret , NumaPerf suggests aproportion of . . . . for its four types ofthreads. With this suggestion, we are configuring the threadsto be . With this assignment, ferret ’sperformance increases by 206% compared with the original version. In contrast, SyncPerf suggests an assignment of . However, following such an assignment actuallydegrades the performance by 354% instead. Performance overhead of

NumaPerf and others.We also evaluated the performance of

NumaPerf on PAR-SEC applications, and the performance results are shownin Figure 3. On average,

NumaPerf ’s overhead is around585%, which is orders-of-magnitude smaller than the state-of-the-art fine-grained profiler — NUMAPROF [30]. In con-trast, NUMAPROF’s overhead runs × slower than theoriginal one. NumaPerf is designed carefully to avoid suchhigh overhead, as discussed in Section 3. Also,

NumaPerf ’scompiler-instrumentation also helps reduce some overheadby excluding memory accesses on stack variables.There are some exceptions. Two applications impose morethan × overhead, including Swaption and x264. Based onour investigation, the instrumentation with an empty functionimposes more than × overhead. The reason is that they havesignificantly more memory accesses compared with otherapplications like blackscholes. Based on our investigation,swaption has more than × memory accesses than blacksc-holes in a time unit. Applications with low overhead can becaused by not instrumenting libraries, which is typically notthe source of NUMA performance issues. We further evaluated

NumaPerf ’s memory overhead withPARSEC applications. The results are shown in Table 2. In to-tal,

NumaPerf ’s memory overhead is around 28%, which ismuch smaller than the state-of-the-art fine-grained profiler —NUMAPROF [30].

NumaPerf ’s memory overhead is mainlycoming from the following resources. First,

NumaPerf recordsthe detailed information in page-level and cache-level, so thatwe could provide detailed information to assist bug fixes.Second,

NumaPerf also stores allocation callsites for everyobject in order to attribute performance issues back to thedata.We notice that some applications have a larger percentageof memory overhead, such as streamcluster . For it, a umaPerf: Predictive and Full NUMA Profiling Conference’17, July 2017, Washington, DC, USA

Apps Memory Usage (MB)Glibc

NumaPerf

NUMAPROFblackscholes 617 689 685bodytrack 36 139 260canneal 887 1476 2383dedup 917 1806 2388facesim 2638 2826 3005ferret 160 301 445fluidanimate 470 667 753raytrace 1287 1610 2089streamcluster 112 216 928swaptions 28 67 255vips 226 283 463x264 2861 3039 3108Total

Memory consumption of different profilers.large object has very serious NUMA issues. Therefore, record-ing page and cache level detailed information contributes tothe major memory overhead. However, overall,

NumaPerf ’smemory overhead is totally acceptable, since it provides muchmore helpful information to assist bug fixes.

We further confirm whether

NumaPerf is able to detect sim-ilar performance issues when running on a non-NUMA orUMA machine. We further performed the experiments ona two-processor machine, where each processor is Intel(R)Xeon(R) Gold 6230 and each processor has 20 cores. Weexplicitly disabled all cores in node 1 but only utilizing 16hardware cores in node 0. This machine has 256GB of mainmemory, 64KB L1 cache, and 1MB of L2 cache. The exper-imental results are further listed in Table 3. For simplicity,we only listed the applications, the issue number, and seriousscores in two different machines.Table 3 shows that most reported scores in two machinesare very similar, although with small variance. The smallvariance could be caused by multiple factors, such as paral-lelization degree (concurrency). However, this table showsthat all serious issues can be detected on both machines. Thisindicates that

NumaPerf achieves its design goal, whichcould even detect NUMA issues without running on a NUMAmachine.

NumaPerf bases on compiler-based instrumentation to cap-ture memory accesses. Therefore, it shares the same short-comings and strengths of all compiler-based instrumentation.On the one side,

NumaPerf can perform static analysis toreduce unnecessary memory accesses, such as accesses ofstack variables.

NumaPerf typically achieves much betterperformance than binary-based instrumentation tools, such as Application Specific Issues

Table 3.

Evaluation on architecture Sensitiveness. We evalu-ated

NumaPerf on a non-NUMA (UMA) machine, whichhas very similar results as that on a NUMA machine. For ferret , NumaPerf reports a proportion of on the 8-node NUMA machine, and on theUMA machine.Numaprof [30]. On the other side,

NumaPerf requires there-compilation (and the availability of the source code), andwill miss memory accesses without the instrumentation. Thatis, it can not detect NUMA issues caused by non-instrumentedcomponents (e.g., libraries), suffering from false negatives.However, most issues should only occur in applications, butnot libraries.

This section discusses NUMA-profiling tools at first, and thendiscusses other relevant tools and systems.

Simulation-Based Approaches:

Bolosky et al. propose tomodel NUMA performance issues based on the collectedtrace, and then derive a better NUMA placement policy [6]. onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst

NUMAgrind employs binary instrumentation to collect mem-ory traces, and simulates cache activities and page affin-ity [33]. MACPO reduces the overhead of collecting memorytraces and analysis by focusing on code segments that haveknown performance bottlenecks [29]. That is, it typically re-quires programmer inputs to reduce its overhead. Simulation-based approaches could be utilized for any architecture, whichare very useful. However, they are typically extremely slow,with thousands of performance slowdown, which makes themun-affordable even for development phases. Further, they stillrequire to evaluate the performance impact for a given archi-tecture, which will significantly limit its usage.

NumaPerf utilizes a measurement based approach, which avoids signifi-cant performance overhead of simulation-based approaches.

Fine-Grained Approaches:

TABARNAC focuses on thevisualization of memory access behaviors of different datastructures [2]. It uses PIN to collect memory accesses ofevery thread on the page level, and then relates with datastructure information together to visualize the usage of datastructures. It introduces the runtime overhead between × and × , in addition to its offline overhead. Diener et al. pro-pose to instrument memory accesses with PIN dynamically,and then characterize distribution of accesses of differentNUMA nodes [10]. The paper does not present the detailedoverhead. Numaprof also uses the binary instrumentation(i.e., PIN) to collect and identify local and remote memoryaccesses [30]. Numaprof relies on a specific thread binding todetect remote accesses, which shares the same shortcomingas other existing work [24, 35]. Numaprof also shares thesame issues with other tools, which only focuses on remoteaccesses while omitting other issues such as cache coherenceissues and imbalance issues. In addition, Numaprof is onlya code-based profiler that could only report program state-ments with excessive remote memory access, which requiresprogrammers to figure out the data (object) and a specificstrategy. Due to this shortcoming, it makes the comparisonwith Numaprof extremely difficult and time-consuming. Incontrast, although NumaPerf also utilizes fine-grained mea-surement, it detects more issues that may cause performanceissues in any NUMA architecture, and provides more usefulinformation for bug fixes.

Coarse-Grained Approaches:

Many tools employ hard-ware Performance Monitoring Units (PMU) to identify NUMA-related performance issues, such as VTune [14], Memphis [26],MemProf [18], Xu et al. [24], NumaMMA [32], and LaProf [35],where their difference are further described in the following.Both VTune [14] and Memphis [26] only detects NUMA-performance issues on statically-linked variables. MemProfproposes the employment of hardware Performance Moni-toring Units (PMU) to identify NUMA-related performanceissues [18], with the focus on remote accesses. It constructsdata flow between threads and objects to help understandNUMA performance issues. One drawback of MemProf is that it requires an additional kernel module that may preventpeople of using it. Similarly, Xu et al. also employ PMU to de-tect NUMA performance issues [24], but without the changeof the kernel. It further proposes a new metric, the NUMAlatency per instruction, to evaluate the seriousness of NUMAissues. This tool has a drawback that it statically binds everythread to each node, which may miss some NUMA issuesdue to its static binding. NumaMMA also collects traces withPMU hardware, but focuses on the visualization of memoryaccesses [32]. LaProf focuses on multiple issues that maycause performances issues in NUMA architecture [35], in-cluding data sharing, shared resource contention, and remoteimbalance. LaProf has the same shortcoming by binding everythread statically. Overall, these sampling-based approachesalthough imposes much lower overhead, making them ap-plicable even for the production environment, they cannotdetect all NUMA performance issues especially when most ofthem only focus on remote accesses. In contrast,

NumaPerf aims to detect performance issues inside development phases,avoiding any additional runtime overhead. Also,

NumaPerf focuses more aspects with a predictive approach, not justlimited to remote accesses in the current hardware. Our eval-uation results confirm

NumaPerf ’s comprehensiveness andeffectiveness.

RTHMS also employs PIN to collect memory traces, andthen assigns a score to each object-to-memory based on itsalgorithms [28]. It aims for identifying the peformance issuesfor the hybrid DRAM-HBM architecture, but not the NUMAarchitecture, and has a higher overhead than

NumaPerf .Some tools focus on the detection of false/true sharing is-sues [8, 13, 21–23], but skipping other NUMA issues.SyncPerf also detects load imablance and predicts the op-timal thread assignment [1]. SyncPerf aims to achieve theoptimal thread assignment by balancing the waiting time ofeach types of threads. In contrast,

NumaPerf suggests theoptimal thread assignment based the number of accesses ofeach thread, which indicates the actual workload.

Parallel applications running on NUMA machines are proneto different types of performance issues. Existing NUMAprofilers may miss significant portion of optimization oppor-tunities. Further, they are bound to a specific NUMA topology.Different from them,

NumaPerf proposes an architecture-independent and scheduling-independent method that coulddetect NUMA issues even without running on a NUMA ma-chine. Comparing to existing NUMA profilers,

NumaPerf detects more performance issues without false alarms, andalso provides more helpful information to assist bug fixes.In summary,

NumaPerf will be an indispensable tool thatcould identify NUMA issues in development phases. umaPerf: Predictive and Full NUMA Profiling Conference’17, July 2017, Washington, DC, USA

References [1] Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, andAbdullah Muzahid. Syncperf: Categorizing, detecting, and diagnosingsynchronization performance bugs. In

Proceedings of the TwelfthEuropean Conference on Computer Systems , EuroSys ’17, pages 298–313, New York, NY, USA, 2017. ACM.[2] David Beniamine, Matthias Diener, Guillaume Huard, and PhilippeO. A. Navaux. Tabarnac: Visualizing and resolving memory accessissues on numa architectures. In

Proceedings of the 2nd Workshop onVisual Performance Analysis , VPA ’15, New York, NY, USA, 2015.Association for Computing Machinery.[3] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R.Wilson. Hoard: a scalable memory allocator for multithreaded applica-tions. In

ASPLOS-IX: Proceedings of the ninth international conferenceon Architectural support for programming languages and operatingsystems , pages 117–128, New York, NY, USA, 2000. ACM Press.[4] Christian Bienia and Kai Li. PARSEC 2.0: A new benchmark suite forchip-multiprocessors. In

Proceedings of the 5th Annual Workshop onModeling, Benchmarking and Simulation , June 2009.[5] Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexan-dra Fedorova. A case for numa-aware contention management onmulticore systems. In

Proceedings of the 2011 USENIX Conference onUSENIX Annual Technical Conference , USENIXATC’11, pages 1–1,Berkeley, CA, USA, 2011. USENIX Association.[6] William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J.Fowler, and Alan L. Cox. Numa policies and their relation to memoryarchitecture. In

Proceedings of the Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems , ASPLOS IV, pages 212–221, New York, NY, USA, 1991.ACM.[7] Derek Bruening, Timothy Garnett, and Saman Amarasinghe. An in-frastructure for adaptive dynamic optimization. In

Proceedings ofthe International Symposium on Code Generation and Optimization:Feedback-Directed and Runtime Optimization , CGO ’03, page 265–275,USA, 2003. IEEE Computer Society.[8] Milind Chabbi, Shasha Wen, and Xu Liu. Featherlight on-the-fly false-sharing detection. In Andreas Krall and Thomas R. Gross, editors,

Proceedings of the 23rd ACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming, PPoPP 2018, Vienna, Austria,February 24-28, 2018 , pages 152–167. ACM, 2018.[9] Charlie Curtsinger and Emery D. Berger. Coz: Finding code thatcounts with causal profiling. In

Proceedings of the 25th Symposium onOperating Systems Principles , SOSP ’15, pages 184–197, New York,NY, USA, 2015. ACM.[10] Matthias Diener, Eduardo HM Cruz, Laércio L Pilla, Fabrice Dupros,and Philippe OA Navaux. Characterizing communication and pageusage of parallel applications for thread and data mapping.

PerformanceEvaluation , 88:18–36, 2015.[11] Stephane Eranian, Eric Gouriou, Tipp Moseley, and Willem de Bruijn.Linux kernel profiling with perf. https://perf.wiki.kernel.org/index.php/Tutorial , 2015.[12] Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof:a call graph execution profiler. In

SIGPLAN Symposium on CompilerConstruction , pages 120–126, 1982.[13] Christian Helm and Kenjiro Taura. Perfmemplus: A tool for automaticdiscovery of memory performance problems. In

International Con-ference on High Performance Computing , pages 209–226. Springer,2019.[14] Intel Corporation. Intel VTune performance analyzer. .[15] Lawrence Livermore National Laboratory. Livermore unstructuredlagrangian explicit shock hydrodynamics (lulesh). https://codesign.llnl.gov/lulesh.php. , Dec 2010. [16] Lawrence Livermore National Laboratory. Llnl coral benchmarks. https://asc.llnl.gov/CORAL-benchmarks. , Dec 2013.[17] Lawrence Livermore National Laboratory. Llnl sequoia benchmarks. https://asc.llnl.gov/sequoia/benchmarks. , Dec 2013.[18] Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. Memprof: Amemory profiler for numa multicore systems. In

Proceedings of the2012 USENIX Conference on Annual Technical Conference , USENIXATC’12, pages 5–5, Berkeley, CA, USA, 2012. USENIX Association.[19] Christoph Lameter. An overview of non-uniform memory access.

Com-mun. ACM , 56(9):59–54, September 2013.[20] Chris Lattner and Vikram Adve. LLVM: A compilation frameworkfor lifelong program analysis & transformation. In

Proceedings ofthe International Symposium on Code Generation and Optimization:Feedback-directed and Runtime Optimization , CGO ’04, pages 75–,Washington, DC, USA, 2004. IEEE Computer Society.[21] Tongping Liu and Emery D. Berger. Sheriff: precise detection andautomatic mitigation of false sharing. In

Proceedings of the 2011 ACMinternational conference on Object oriented programming systemslanguages and applications , OOPSLA ’11, pages 3–18, New York, NY,USA, 2011. ACM.[22] Tongping Liu and Xu Liu. Cheetah: Detecting false sharing efficientlyand effectively. In

Proceedings of the 2016 International Symposiumon Code Generation and Optimization , CGO 2016, pages 1–11, NewYork, NY, USA, 2016. ACM.[23] Tongping Liu, Chen Tian, Hu Ziang, and Emery D. Berger. Predator:Predictive false sharing detection. In

Proceedings of 19th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming ,PPOPP’14, New York, NY, USA, 2014. ACM.[24] Xu Liu and John Mellor-Crummey. A tool to analyze the performanceof multithreaded programs on numa architectures. In

Proceedings of the19th ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming , PPoPP ’14, pages 259–272, New York, NY, USA, 2014.ACM.[25] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-wood. Pin: Building customized program analysis tools with dynamicinstrumentation. In

Proceedings of the 2005 ACM SIGPLAN Confer-ence on Programming Language Design and Implementation , PLDI’05, pages 190–200, New York, NY, USA, 2005. ACM.[26] C. McCurdy and J. Vetter. Memphis: Finding and fixing numa-relatedperformance problems on multi-core platforms. In , pages 87–96, March 2010.[27] Nicholas Nethercote and Julian Seward. Valgrind: A framework forheavyweight dynamic binary instrumentation. In

Proceedings of the28th ACM SIGPLAN Conference on Programming Language Designand Implementation , PLDI ’07, page 89–100, New York, NY, USA,2007. Association for Computing Machinery.[28] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, ErwinLaure, and Stefano Markidis. Rthms: A tool for data placement onhybrid memory system. In

Proceedings of the 2017 ACM SIGPLANInternational Symposium on Memory Management , ISMM 2017, page82–91, New York, NY, USA, 2017. Association for Computing Ma-chinery.[29] Ashay Rane and James Browne. Enhancing performance optimizationof multicore chips and multichip nodes with data structure metrics. In

Proceedings of the 21st International Conference on Parallel Architec-tures and Compilation Techniques , PACT ’12, pages 147–156, NewYork, NY, USA, 2012. ACM.[30] Othman Bouizi Sebastien Valat. Numaprof, a numa memory profiler. In

Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops.Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer,Cham , pages 159–170, December 2018. onference’17, July 2017, Washington, DC, USAXin Zhao, Jin Zhou, and Hui Guan, University of Massachusetts Amherst, Wei Wang, University of Texas at San Antonio, Xu Liu, North CarolinaState University, and Tongping Liu, University of Massachusetts Amherst [31] Konstantin Serebryany, Derek Bruening, Alexander Potapenko, andDmitry Vyukov. Addresssanitizer: A fast address sanity checker. In

Proceedings of the 2012 USENIX Conference on Annual TechnicalConference , USENIX ATC’12, pages 28–28, Berkeley, CA, USA, 2012.USENIX Association.[32] François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet.Numamma: Numa memory analyzer. In

Proceedings of the 47th In-ternational Conference on Parallel Processing , ICPP 2018, New York,NY, USA, 2018. Association for Computing Machinery.[33] R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profilingdirected numa optimization on linux systems: A case study of the gaussian computational chemistry code. In , pages 1046–1057, May2011.[34] Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong,and Saman Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In

The International Conference on VirtualExecution Environments , Newport Beach, CA, Mar 2011.[35] L. Zhu, H. Jin, and X. Liao. A tool to detect performance problemsof multi-threaded programs on numa systems. In2016 IEEE Trust-com/BigDataSE/ISPA