[PDF] GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Abstract

We present a parallel profiling tool, GAPP, that identifies serialization bottlenecks in parallel Linux applications arising from load imbalance or contention for shared resources . It works by tracing kernel context switch events using kernel probes managed by the extended Berkeley Packet Filter (eBPF) framework. The overhead is thus extremely low (an average 4% run time overhead for the applications explored), the tool requires no program instrumentation and works for a variety of serialization bottlenecks. We evaluate GAPP using the Parsec3.0 benchmark suite and two large open-source projects: MySQL and Nektar++ (a spectral/hp element framework). We show that GAPP is able to reveal a wide range of bottleneck-related performance issues, for example arising from synchronization primitives, busy-wait loops, memory operations, thread imbalance and resource contention.

Full PDF

GGAPP: A Fast Profiler for Detecting Serialization Bottlenecks inParallel Linux Applications

Reena Nair ∗ Tony Field ∗ [email protected]@imperial.ac.ukImperial College LondonLondon ABSTRACT

We present a parallel profiling tool, GAPP, that identifies serializa-tion bottlenecks in parallel Linux applications arising from loadimbalance or contention for shared resources . It works by trac-ing kernel context switch events using kernel probes managed bythe extended Berkeley Packet Filter (eBPF) framework. The over-head is thus extremely low (an average 4% runtime overhead forthe applications explored), the tool requires no program instru-mentation and works for a variety of serialization bottlenecks. Weevaluate GAPP using the

Parsec3.0 benchmark suite and two largeopen-source projects:

MySQL and

Nektar++ (a spectral/hp elementframework). We show that GAPP is able to reveal a wide range ofbottleneck-related performance issues, for example arising fromsynchronization primitives, busy-wait loops, memory operations,thread imbalance and resource contention.

CCS CONCEPTS • General and reference → Performance ; Measurement.

KEYWORDS

Bottlenecks, Multithreaded, Parallel, Profiler, Kernel Tracing, eBPF,Context-switch

ACM Reference Format:

Reena Nair and Tony Field. 2020. GAPP: A Fast Profiler for Detecting Seri-alization Bottlenecks in Parallel Linux Applications. In

Proceedings of the2020 ACM/SPEC International Conference on Performance Engineering (ICPE’20), April 20–24, 2020, Edmonton, AB, Canada.

ACM, New York, NY, USA,8 pages. https://doi.org/10.1145/3358960.3379136

A key challenge for multi-core program developers is identifyingand fixing performance problems and this is exacerbated in multi-core systems because of competition for shared resources. Suchresources may be physical, e.g. a hardware accelerator which can ∗ Both authors contributed equally to this research.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ICPE ’20, April 20–24, 2020, Edmonton, AB, Canada © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6991-6/20/04...$15.00https://doi.org/10.1145/3358960.3379136 only be accessed by one thread of computation at a time, or logical,e.g. a lock protecting access to parts of, or the entirety of, a shareddata structure. The speed-up achievable in such cases is thus limitedby the inherent serialization that occurs, as governed by Amdahl’slaw.The resource(s) that experience the greatest contention are calledthe bottleneck resources , or just bottlenecks , akin to the terminologyused in queuing theory. Such resources have a queue of servicerequests associated with them and the immediate symptom of abottleneck is excessive queuing at one or more of those resources.The key insight from queuing theory is that performance can only be improved significantly by “fixing the bottleneck”, i.e. by reducingthe load placed on the resource, e.g. by accessing it less often orholding it for less time, on average. Moreover, when there aremultiple bottlenecks in a system, we can rank them based on ametric that takes into account how long the bottleneck is activeand the length of the queue.This paper presents a new bottleneck detection tool, called GAPP(Generic Automatic Parallel Profiler), which automatically pinpointsthe line(s) of code that represent bottlenecks in a parallel Linuxapplication. In contrast to some other bottleneck detection sys-tems which work by instrumenting specific languages, libraries orsynchronization primitives [11, 25–27, 29], GAPP uses lightweightinstrumentation at the kernel level. Furthermore, all bottleneckanalysis in GAPP is performed at run time, which avoids the needto generate, and subsequently process, potentially expensive tracefiles. Indeed, a key objective in GAPP’s design has been to minimizeoverhead, both during program execution and in post processing.The idea is to use the extended Berkeley Packet Filter (eBPF)framework [9] (eBPF has been a standard feature of Linux sincev4.1) to trace context-switching events inside the Linux kernel andkeep track of the number of active threads at all times during aprogram’s execution. We then identify the bottleneck thread(s) bytaking account of both the duration of the thread’s time-slices andthe degree of parallelism exhibited whilst each is executing. Weuse a weighting algorithm similar to that described in [14, 20] andlater also in [22] in order to do this. The information capturedat context switching events is augmented by information from alightweight sampling-based profiler that identifies the programcounter location(s) that correspond to the bottleneck in the code.This is also implemented using eBPF . If the program under testis compiled to allow stack tracing, we are then able to use theinformation gathered to pinpoint the line(s) of source code thatconstitute the bottleneck and which should therefore be the targetfor optimization. a r X i v : . [ c s . PF ] A p r hread T T T T T T Thread Thread Thread E E E E E E E Figure 1: Switching intervals

We make the following contributions: • We present GAPP, a tool to identify arbitrary serializationbottlenecks in parallel Linux applications based on a weighted criticality metric (

CMetric ) (Section 2). The tool has very lowoverhead, requires no instrumentation and works for a vari-ety of bottlenecks, where there is reduced parallelism. • We show how eBPF can be used to build a lightweight sample-based profiler that, when used in conjunction with the bottle-neck detector, is able to point the programmer at the line(s)of source code where the bottleneck is located (Section 4.3). • We evaluate the effectiveness of GAPP using

MySQL and

Nektar++ [7] and benchmarks from the Parsec 3.0 suite (Sec-tion 5). We confirm known bottlenecks in Parsec 3.0 and alsoexpose and discuss new, previously unreported bottlenecks.Performance experiments show that GAPP introduces anaverage overhead of circa 4% (maximum circa 13%) for theapplications considered.

Existing parallel profilers identify bottlenecks by ranking coderegions in terms of their share of parallelism [22], optimizationopportunity [10] and asymptotic parallelism, i.e the ratio of totalwork to critical work [27]. GAPP identifies bottlenecks by rankingthread execution slices in terms of their execution time, weightedby the number of active threads. We use the term

Criticality Met-ric (CMetric) for this weighted metric, borrowing the terminologyfrom [14]. The advantage of this metric is that it distinguishesthreads that execute for short periods with low parallelism fromthose that execute for longer periods with a similar low degree ofparallelism.

CMetric

The

CMetric is calculated at the end of each execution timeslice.Timeslices are further divided into switching intervals demarcatedby instants when any thread in the application changes state fromthe active to inactive state, or vice versa, as it is only this action thatcan change the degree of parallelism (threads are considered to beactive in TASK_RUNNING state and inactive otherwise). Figure 1shows an example trace of a multithreaded application with fourthreads. The E i , ≤ i ≤ T i , ≤ i ≤ E i and E i + . Thus, for example, Thread ’s timeslice spans two switch-ing intervals: T and T .The CMetric for every active thread is updated at the end of eachswitching interval. For example, in the interval T in Figure 1, thereare two active threads, and hence the CMetric for that interval is T , which is added to the individual CMetric of both

Thread and Thread . The CMetric for a timeslice of execution is the sum of thecontributions from each switching interval that occurred duringthe timeslice.As an example, if a thread’s timeslice spans the switching inter-vals j , j + , ..., k then the CMetric at the end of its timeslice willbe (cid:205) ki = j T i n i , where n i is the number of application threads that areactive during interval i .In [14] the idea was to use special-purpose hardware to keeptrack of the criticality metric with low cost. Here, we seek to achievea similar effect in software, by exploiting existing mechanisms fortracing, and subsequently filtering, context-switch events in theLinux kernel. Calculating the

CMetric requires tracking the duration and the num-ber of active threads for each switching interval. In order to avoidinstrumenting thread library primitives, and to capture schedulingactivities triggered by other events, for example I/O, we keep trackof active threads and their execution duration by tracing context-switch events using the eBPF framework, which provides a fast andsecure mechanism to selectively trace kernel events [17, 23].Kernel events are traced by eBPF using place holders in thekernel, called tracepoints which, when enabled, can be monitored byattaching user defined probe functions to them [13]. GAPP attachesa probe function to such a tracepoint, sched_switch , which istriggered at each context switch. This probe function keeps trackof the number of active threads and also calculates the

CMetric foreach switching interval, as specified above.Information about context-switches are maintained and sharedwith the user-space using eBPF maps . Maps can be global (shared be-tween all cores) or per-cpu (local to a core). The GAPP architectureis shown in Figure 2.In order to compute the

CMetric for the application threads,GAPP’s probe functions set up the following eBPF maps:

Map name Description cm_hash

Global hash map to store the

CMetric of each thread global_cm

Global scalar - cumulative sum of

CMetric s across all switching events local_cm

Local scalar - records value of global_cm when a thread switches in thread_count

Global scalar - keeps track of the no. of active threads at any time total_count

Global scalar - stores the total number of threads in the application thread_list

Global hash - for each thread, if thread is inactive and if active t_switch Local scalar - stores the timestamp of the most recent switching event

Table 1: eBPF Maps for calculating CMetric

The sched_switch probe function need to be executed only if ei-ther or both of the threads being switched in/out belong to theapplication. We therefore capture and store to thread_list , theidentifiers of the application tasks, by attaching additional probesto task_rename and task_newtask tracepoints, which are invokedwhen new processes/threads are created. They also increment total_count as new tasks are created, and a probe attached to sched_process_exit tracepoint decrements it, when threads exit.Hence at any time, total_count represents the total number ofthreads in the application. .2 Maintaining the number of active threads

The thread_count is maintained in part by the probe functionattached to the sched_switch tracepoint. The arguments to thistracepoint include a. prev_pid : id of the thread being switched outand b. next_pid : id of the thread being switched in.The thread_count is incremented whenever the next_pid belongsto the application and was marked inactive in the thread_list .Threads that were already in the running state do not alter the thread_count when they are switched in.

Thread_count is decre-mented only when prev_pid is switched out to an inactive state.When a waiting thread is woken up it will eventually be switchedin, but there may be a delay between the wake-up and the con-text switch. During this time the thread is runnable, so it shouldbe marked active in thread_list as soon as it is woken up. Wetherefore also trace wake-up events by attaching a probe functionto the sched_wakeup tracepoint. This decrements thread_count if the thread being woken up belongs to the application.

Bottlenecks are identified by ranking execution timeslices in termsof their

CMetric , as detailed in Section 4.4 below. At each contextswitch event, global_cm is updated from the kernel probe thus: global_cm += (t-t_switch)/thread_count , where t is the cur-rent time and t_switch is the timestamp of the last switchingevent. Notice that t-t_switch corresponds to the length of thelatest switching interval ( T i for some i in Figure 1). When an ap-plication thread is switched out, the cm_hash entry for that thread(identifier prev_pid ) is then updated thus: cm_hash[prev_pid] += global_cm -local_cm Note that if n i is the number of active application threads in the i th interval and the current timeslice spans the switching inter-vals between j and k inclusive, then the right-hand side above isequivalent to (cid:205) ki = j T i n i , as required.If the thread being switched in belongs to the application thenwe store the value of global_cm to prepare for the next update to cm_hash , viz. local_cm = global_cm . At the end of a time-slice, we need to determine if it representsa potential bottleneck, and if so, we wish to pinpoint the line(s)of code that correspond to the bottleneck and also the call paththat led to it by means of stack traces. The latter is important as itmay be that a bottleneck appears only for a specific call path, eventhough other paths may lead to the same line of code.A stack trace is recorded if the average level of parallelism duringthe timeslice is below a given target threshold, N min , a tunableparameter, thus indicating the execution of a potential bottleneckat some stage during that timeslice. To do this we compute theweighted average of the number of active threads during a timeslice, threads_av , similar to the calculation of the CMetric detailed above.A stack trace is triggered if threads_av < N min .At the end of a time-slice, if a stack trace has been triggered,then the thread id, CMetric and stack trace are written to a circularbuffer managed by eBPF (Figure 2). This buffer is readable from a

Core 1KernelProbes Core 2KernelProbes Core nKernelProbesRead / WriteUserProbeRead Kernel SpaceUser SpacePer-cpu mapsGlobal maps

Figure 2: GAPP Architecture user-space probe that runs in parallel with the application threads.To avoid having to store every entry in deep stack traces we onlyrecord the top M (user-specified) entries in each. The above scheme works well if threads happen to be executingbottleneck code when a context switch happens. However, in manycases bottleneck code can be expected to complete its executionbefore the next context-switch event, i.e. before the next stacktracing opportunity. Because of this we have found that stack tracestaken at context switches alone are often not enough to pinpointthe root cause of a bottleneck.To address this problem, we use eBPF to add an additional light-weight periodic sampling probe, with period ∆ t (another tunableparameter) which additionally records the instruction pointer when-ever the absolute number of active threads, as given by thread_count , is less than N min . If these conditions are met then thethread id and instruction pointer are written to the same eBPF circular buffer referred to above. The user-space probe, written using the

BPF Compiler Collection (bcc) [17] tool kit, runs in parallel with the application threads andcommunicates with the kernel probes via the circular buffer.Instruction pointers from the sampling probe are read and as-sembled in a hash map that is indexed by the thread id. When a

CMetric and stack trace associated with the same thread id are readfrom the buffer the information accumulated is used to populatethree locally-managed hash maps, indexed by a unique id, ts_id ,associated with the timeslice. These hash maps contain the

CMetric ,the call path corresponding to the stack trace, and a list of addresseswhich include those from the sampling probe. These addresses arethe candidates for bottleneck lines of code. If a stack trace is not triggered at the end of a timeslice then a special entry is written tothe circular buffer, along with the current thread id, which instructsthe user probe to reject any instruction pointers from sample probesassociated with that thread.When the program terminates, the user probe enters a post-processing phase which seeks to determine the instruction pointersthat have the highest collective

CMetric . At this point there will, ingeneral, be a number of identical call paths whose ts_id indices areincluded in the list of

CMetric entries. If i and j are two such indices,hen the corresponding entries in the address lists and CMetric mapare merged by: a . summing the CMetric values at indices i and j , b .combining the addresses at indices i and j to generate a frequencytable for those addresses.At the end of the merge process, we are thus left with two hashmaps, each indexed by a unique call path; one with the accumulated CMetric for the call path and the other with the list of sampledaddresses for the same call path. The entries with the top N total CMetric s are then taken as the bottlenecks.The reason why the top N entries are chosen instead of the topone alone is because one call path could be a subset of another, aswould occur if a context-switch happened during execution of onefunction that indirectly calls another that contains a bottleneck.Finally, the items in the address map(s) are mapped to func-tion names and the lines of code associated with them by callingthe Linux addr2line utility. The final profile is presented as a fre-quency table of functions and lines of code as illustrated in Figure 7. Critical timeslices with no samples.

Sometimes, a timeslicemay be identified as critical, but the sampler may fail to gatherthe instruction pointer(s) that define the location of the bottleneck.This can happen due to two reasons: either the timeslice was tooshort, so that the sampler missed it, or the sample belonged to ashared library or kernel code and hence could not be mapped tothe source code. In the absence of any samples, the next best thingis to add the address at top of the stack, i.e., the return address ofthe caller, to the list of samples. In summary, the top stack addressis attached to the samples, if a . the sample count is zero and b . theactive thread count is less than or equal to N min , when the threadis switched out. Such samples are labelled as being from stack topto help the user interpret results correctly. We now detail a series of experiments that aim to achieve: a) Evalu-ate GAPP’s ability to pinpoint bottlenecks in parallel applicationsand b) Quantify the overhead of the tool for various benchmarks.

We evaluate GAPP using applications from the Parsec 3.0 bench-mark suite [6] and two larger open-source projects:

MySQL and

Nek-tar++ [7], a spectral/hp element framework for solving partial differ-ential equations. Experiments were performed on a server machinewith four AMD Opteron 6282SE eight-core/16-thread CPUs (with atotal capability to run 64 threads in parallel) and 128GB of RAM. Thesystem runs on Linux 4.15 with the bcc toolkit installed. The applica-tions were compiled with gcc version 7.3.0 at the -O3 optimizationlevel. The compiler options -g and -fno-omit-frame-pointer were enabled to aid effective call stack retrieval. Recent versions of gcc generates position-independent executable by default. To mapaddresses to source code using addr2line , this behaviour need tobe overridden through compiler and linker options -fno-pie and -no-pie respectively.In the initial set of experiments we used N min = n / n is the number of application threads, and a sampling period of ∆ t = ms . The sensitivity of the framework to these parameters isevaluated in the README file of GAPP’s Github repository [1].

In this section, we analyze the results of profiling different parallelapplications with GAPP.

Parsec 3.0 benchmark

We evaluated GAPP with 11 multi-threaded applications from theParsec3.0 [6] benchmark suite. The applications were executedwith 64 threads on the native input set, as we were using a 64-coremachine, although the framework does not restrict the numberof threads in any way. All the applications except

Freqmine werecompiled with the

Pthreads library.

Freqmine uses the

OpenMP threading library.Out of the 11 applications used for evaluation, two, namely

Dedup and

Ferret , are task-parallel; the remaining are data-parallel. GAPPis able to identify not only previously established bottlenecks, butalso several that have not previously been reported. We compare theresults with those from previous studies and show that GAPP is ableto detect, and pinpoint, serialization bottlenecks resulting from, e.g.synchronization primitives, workload imbalance, busy-wait loopsand contention.Since these applications have been extensively analyzed in pre-vious studies, we only report results which add to the existinganalysis. The results obtained for other applications are listed inTable-2, which summarizes the critical functions along with origi-nal program execution time (T), the GAPP overhead as a percentageof the original time (O/H), the proportion of critical timeslices tothe total timeslices as a percentage (CR), the memory usage of thetool (M) in MB and the post processing time (PPT) in seconds.

Bodytrack.

Bodytrack is a vision application that tracks the 3Dpose of a marker-less human body with multiple cameras throughan image sequence. The application uses worker threads to processvideo frames as per the commands from the parent thread. GAPPidentified

OutputBMP() and

RecvCmd() as the top critical functions.In

RecvCmd() , the worker threads wait on a conditional variabletill they receive commands from the parent thread.

UpdateEstimate

WritePoseOutBMP

Read next set of images from queueSend command to worker threads AsyncIO thread

Pool of worker threadsWriter threadDelegate to

Images

Original

ImplementationModification

Queue

Figure 3: General logic of

Bodytrack

The

OutputBMP function invoked by the parent thread, generatesthe camera image and saves it in BMP format. While the imagesare processed by

OutputBMP , the worker threads are waiting onthe conditional variable in the

RecvCmd function. We commentedout

OutputBMP function and profiled all threads with GAPP, whichshowed 45% reduction in the number of samples from

RecvCmd . Thisconfirms that the

OutputBMP function is indeed the bottleneck. pplication Critical Functions identified by GAPP O/H T (s) CR M (MB) PPT (s)Blackscholes CNDF() < 1% 29.4 470 (2%) 109 0.02Bodytrack OutBMP, RecvCmd 5% 21.3 6823 (0.5%) 112 0.4Canneal netlist_elem::swap_cost 2.2 % 62 267 (0.06%) 112 0.1Dedup deflate_slow 12% 13 362544 (40%) 372 3.3Facesim Update_Position_Based_State_Helper 4.4% 59 334 (0.004%) 118 0.8Ferret dist_L2_float 3% 30 42127 (51%) 132 1.2Fluidanimate parsec_barrier_wait 2% 37 11512 (1%) 112 0.3Freqmine FPArray_scan2_DB 2% 34 11721 (13%) 110 0.6Streamcluster parsec_barrier_wait, dist 5.6% 201 2246172 (10.6%) 784 2.5Swaptions HJM_SimPath_Forward_Blocking 1% 8.6 43 (0.07%) 111 0.2Vips imb_LabQ2Lab 4% 14 10460 (3.2%) 118 2.3MySQL fil_flush, sync_array_reserve_cell < 1% 60 825 (0%) 194 3Nektar++ dgemv_ 9% 37 189990 (16%) 446 2.1

Table 2: Critical functions identified with overhead (O/H), execution time (T), Number of critical time slices with proportionof Critical timeslices to total timeslice (CR), memory usage in MB (M) and post-processing time (PPT).

We offloaded the

OutputBMP function from the parent thread toa new thread called writerThread . Consequently, the parent passedcommands to the worker threads at a faster pace, thereby reducingthe worker’s waiting time. The modification improved the perfor-mance of

Bodytrack by 22%. This bottleneck has not been reportedbefore. Figure 3 outlines the logic of

Bodytrack , before and afteroptimization.

Ferret and Dedup.

Ferret and

Dedup are task-parallel applicationsdesigned with different pipeline stages, with a queue connectingeach stage to the next one.

Ferret implements content-based similarity search on images,audio, 3D shapes etc, and is implemented with six pipeline stages.The first and last phases, which perform I/O, are serial phases. Themiddle four stages implement query image segmentation, featureextraction, indexing and ranking respectively. We executed

Ferret with 15 threads in each of the parallel phase making a total of 62threads for the whole execution. The top critical functions identifiedby GAPP were invoked by emd() (Earth Mover’s Distance) whichfinds the pair-wise distance between query image and candidateimages, and also forms the core of the ranking phase. C M e t r i c V a l u e s Thread Index

CMetric values for different thread allocations - Ferret

Figure 4: CMetric for different thread allocations - Ferret

As evident from Figure 4,

Ferret exhibited huge imbalance in

CMetric among the threads(We have used a line chart to repre-sent this discrete data in order to highlight the variation in

CMet-ric among the phases, for a particular thread allocation). The threads with higher

CMetric belonged to the ranking phase as was evidentfrom the critical functions identified by GAPP (see Table 2:Ferret).We redistributed the load among the parallel phases in

Ferret ,until we obtained a uniform

CMetric for the threads, as shown infigure 4. Threads were reallocated as 2-1-18-39 among the parallelphases, which improved the run time of

Ferret by 50%, almost doublethe speed-up achieved by the reallocation of 20-1-22-21 suggestedby [10] (23% speed up on the test bed).

Dedup is designed with five pipeline stages, viz.,

Fragment , Frag-mentRefine , Deduplicate , Compress and

Reorder . The first and laststages perform I/O with a single thread, and the rest parallelize thetask among a pool of worker threads. In our experiments,

Dedup was configured to run with 20 threads in the intermediate stages tomake a total of 62 threads for the application.With an initial thread allocation of 1-20-20-20-1, the write_file() function from the

Reorder phase, and the deflate_slow functionfrom the

Compress phase were identified as the top critical paths.The sequential phase,

Reorder , which writes the compressed chunksto a file, is known to be a bottleneck [12]. To accelerate the

Compress phase, we moved threads from the

FragmentRefine and

Dedupli-cate stages to the

Compress stage, and ran the experiment with1-16-16-28-1 threads in the respective stages. However, increasingparallelism in the

Compress stage increased the run time, whichindicates possible contention in the particular stage. We decreasedthe number of threads in the

Compress stage from 20 to 15 (threadallocation 1-20-20-15-1), and this improved the run time by 14%.The workload imbalance among threads in

Dedup and

Ferret was reported in [26] for a 16 core machine, which also suggestsan optimal thread allocation for the two applications. We haveincluded it here to show how GAPP can be used to tune the loadamong threads in such cases.

We have also tested GAPP with two large applications viz.,

Nek-tar++ and

MySQL . Nektar++ is a multi-process application that uses

OpenMPI , while

MySQL is a multi-threaded database managementsystem. 5 10 15051015 Process ID C M e t r i c Normalized CMetric of individual processes

Figure 5: CMetric of individual processes - Nektar++Nektar++.

We evaluate GAPP on an MPI application,

Nektar++ ,which implements scalable PDE solvers using the spectral/hp ele-ment method [7]. We focused the evaluation on the

IncompressibleNavier-Stokes Solver (IncNSS), which was configured to runwith 16 processes on a cylindrical surface. The solver divides thecomplex problem into several small elements represented as anunstructured mesh, partitions the mesh and assigns each partitionto an individual process.Message passing in

Nektar++ is implemented with

OpenMPI ,and by default, MPI processes are configured to run in “aggressive”mode. In this mode, rather than blocking waiting for messages,processes spin on a busy-wait loop. Hence the initial profile gen-erated by GAPP exhibited uniform

CMetric for all processes, asthey are never idle. However, this behaviour masks any load imbal-ances in the application. To disable busy waiting, we recompiled

Nektar++ with

MPICH , an alternate implementation of the MPI stan-dard, enabling the --with-device=ch3:sock option. This revealeda substantial non-uniformity in load, as shown in Figure 5.A likely cause of such load imbalances is non-uniform partition-ing. To test this we artificially created a structured mesh for a cuboidsurface and uniformly partitioned it among eight processes. The useof a structured mesh makes it easier to generate uniform partitions.With this partitioning, processes exhibited negligible variation in

CMetric , which confirms that the imbalance in the cylidrical solverwas indeed non-uniform partitioning. Fixing this would involvere-engineering the mesh partitioner, which is beyond the scope ofthe current work.As well as identifying the above load imbalance GAPP pinpointed dgemv_() , a matrix multiplication routine exported by the

BLAS li-brary, as the top critical function in

IncNSS . We recompiled

Nektar++ with

OpenBLAS , an optimized version of the

BLAS library. This im-proved the performance of the application by 27% and moved thebottleneck from dgemv_() to Vmath::Dot2() function (Figure 6).Note that this led to negligable change in the observed load im-balance. Note also that dgemv_() doesn’t represent a serializationbottleneck – it is simply an expensive function that happened tobe executing with reduced parallelism.

MySQL.

We profiled

MySQL 5.7 with GAPP, while executing the

OLTP_Read_Write workload from

Sysbench benchmark. The topcritical samples were from pfs_os_file_flush_func() , whichflushes the write buffers of a given file to disk. The call path showsthat this function was invoked by

InnoDB , the transactional storageengine for

MySQL (Figure 7a.). To optimize

InnoDB disk I/O, weincreased the buffer pool size to 90GB (70% of the total systemmemory) as suggested in [3]. This improved the transaction rate d g e m v _ V m a t h D o t d g e m m _ N o r m a l i z e d C o u n t Original V m a t h D o t g a t h e r _ .. _ a ddd g e m v _ n Critical FunctionOptimized

Figure 6: Bottleneck functions in Nektar++ a. Critical Path 1:fil_flush()[mysqld]<---log_write_up_to()<---trx_commit_complete_for_mysql()<---innobase_commit()<---ha_commit_low()<---TC_LOG_DUMMY::commit()<---ha_commit_trans()<---trans_commit()<---mysql_execute_command()<---Prepared_statement::execute()Functions and lines + frequency-------------------------------pfs_os_file_flush_func -- 1462os0file.ic:507 (StackTop) -- 1462 b. Critical Path 2:sync_array_reserve_cell()<---rw_lock_s_lock_spin()<---pfs_rw_lock_s_lock_func()<---row_search_mvcc()<---ha_innobase::index_read()<---handler::ha_index_read_idx_map()<---join_read_const_table()<---JOIN::extract_func_dependent_tables()<---JOIN::make_join_plan()<---JOIN::optimize()Functions and lines + frequency-------------------------------sync_array_reserve_cell() -- 469sync0arr.cc:389 (StackTop) -- 469

Figure 7: Critical Paths - MySQL5.7 (measured in transactions per second) by 19% and reduced theaverage latency by 16%.The second most critical samples were from a spin-wait loop in sync_array_reserve_cell() invoked from rw_lock_s_lock_spin , asshown in Figure 7b.Mutexes and read-write locks in MySQL are designed such thatthe threads poll for them for a short period of time before theyblock. The duration of the spin-wait loop is dependent on a randomnumber between 0 and

INNODB_SPIN_WAIT_DELAY (a configurableconstant). The default value of this constant is 6. [2] advises to setthe constant to a higher value to improve performance on machineswith large number of cores. We set the constant to a value 30, andthis cumulatively improved the transactions rate by 34% and re-duced the average latency by 25%. Moreover, the number of cachemisses decreased by 10.5%, when compared with the default delayvariable value. This shows better cache performance with increasedspin-wait delay. Interestingly, optimising the spin-wait delay with-out first optimising the buffer size made negligible difference to theoverall performance. This emphasises the importance of rankingthe bottlenecks by criticality and tackling each in turn.The bottlenecks detected in

MySQL were fixed by tuning con-figuration parameters, rather than re-factoring the source code.Existing

MySQL performance tuning tools are designed to identifysimilar optimization opportunities. Nonetheless, the experimentsshow that GAPP, which is a generic tool, is capable of identifyingsimilar architecture dependent bottlenecks.

For the 13 applications evaluated, the maximum overhead incurredwas only 13% (Table-2). The overhead induced by a profiler on theapplication is important, as it can alter the application’s normalbehaviour. To minimize the effect of tracing on program behavioure collect call paths only at the end of a context-switch and only ifthe time-slice exhibited reduced parallelism; similarly instructionpointers from the sampling probe.The overhead was found to be dependent on the ratio of criticaltime slices to the total time slices, the depth of the stack traces andthe number of distinct stack traces. This is because GAPP cachesaddress-to-symbol mapping, and hence the mapping time will beless when stack traces are identical.The memory overhead was found to be proportional to the num-ber of samples and the number of critical timeslices/ stacktraces.

Bottleneck detection in multithreaded applications is by no meansnew and several techniques have been proposed. Some of themidentify parallelism bottlenecks based on critical sections [20, 29],critical paths [8], bottlenecks introduced by specific resource access(synchronization) mechanisms, for example locks [11, 25] or bynative threading libraries, such as POSIX threads [26] and IntelTBB [27]. These strategies identify execution regions where theresource is acquired and released by instrumenting the source codeor binary. The major benefit of using instrumentation mechanismsis the ease to map execution traces to the source code. Informationis generated only at instrumentation points thereby limiting theamount of data collected, which makes post-processing faster andeasier. However, they can only pin point bottlenecks that arise dueto a particular class of resource and are also bound to a particularlanguage or library which restricts their use.Techniques that work independent of the language or librarycommonly rely on stack traces [5, 18, 30] or core dumps [4] toidentify expensive call sequences or excessive idle time. They aredesigned to filter and analyze large amounts of data generatedduring program execution.Yet another category of tools either categorize bottlenecks basedon hardware events [16, 19, 28] or limit their scope in identifyingthe critical thread [14, 15] alone.

Bottlegraphs [15] and

Criticality Stacks [14] identify the mostcritical thread by ranking threads based on the execution time andnumber of active threads. They were proposed as mechanisms toidentify and accelerate critical threads for power/ performanceoptimizations.GAPP’s logic bears some similarity to those proposed in [14,15, 24] and [22].

Bottlegraphs [15] tracks active threads using ker-nel modules that intercept futex calls and system calls that create,destroy and schedule threads.

Criticality Stacks [14] proposes a hard-ware approach to calculate the

Criticality Metric and identify thecritical threads which are diverted to faster cores for performanceand energy optimization. While both are capable of identifyingcritical threads, they do not provide information regarding the codesection or function that causes the bottleneck. [22] ranks all basicblocks in a program based on their share of parallel execution. Ex-ecution information is gathered using

Parallel Block Vectors [21],which uses an LLVM compiler pass to instrument the

Pthread libraryroutines in the application. [24] quantifies insufficient parallelismby attaching an idleness metric to each call path. It samples a time-based counter, and a signal handler collects the calling contexts, ifthe application thread is active during the sample. While GAPP’s logic is similar, it does not require any hardwaremodification or software instrumentation, produces accurate re-sults as every scheduling event is captured, and achieves the samegoal with very low overhead, as calling contexts are gathered onlywhen critical. Moreover, GAPP will work with an arbitrary numberof threads or when other applications are running concurrently.This is because GAPP uses the thread state to determine whetherit is active or not, whereas similar approaches proposed in [14]and [24] considers a thread to be active only if it is occupying aCPU core. The calculation of degree of parallelism can go wrongin such cases when there are other applications running concur-rently or when the number of threads in the application is greaterthan the number of CPUs. [22] calculates the degree of parallelismby instrumenting thread creation/exit/synchronization primitivesin the pthread library. Hence, if a thread gets blocked for somereason other than synchronization, it will not be captured by theframework.GAPP’s approach is complementary to a recent study on off-cpuanalysis [31], which identifies waiting events critical to through-put. Information regarding waiting events are recorded by trackinginterrupts and thread switching events using kprobes . This infor-mation is used to build a wait-for-graph , which is post-processedto determine threads that are influenced by a waiting event. Eventhough GAPP does not explicitly state the waiting-relationshipamong tasks, we have observed that the same information can beinterpreted from stack traces generated during context-switches. wPerf can provide more detailed information than GAPP throughits wait-for graphs, but the tradeoff is that it requires substantiallylonger post-processing times, e.g. 271.9 sec for

MySQL as quotedin [31]. The reported application execution time overheads arebroadly similar to GAPP.GAPP’s results are found to be consistent across multiple runs,and the results can be obtained in a single execution. With tools suchas [10], we have observed significant variations across experimentson an 8-core machine. This most likely happens because it uses astatistical approach based on sample count to determine the delayinserted in threads and we have found that this can make it difficultto reproduce results from one run to the next. Also [10] does notreport call paths and this can be important when the same codecan be invoked from multiple paths. Moreover, none of these toolshave been tested on parallel applications that use MPI or similarmessage passing constructs.

Locks or synchronization primitives with low contention may notbe identified by GAPP. Even though these are not strict serializationbottlenecks, it has been proven that using the proper primitive insuch cases can improve the performance [26] and hence is a possibleoptimization opportunity.GAPP may not identify spin locks that spin indefinitely. This is aproblem shared by other frameworks that rely on instrumentation,as spin-locks can be implemented using custom-built primitivesthat instrumentation cannot detect.As in the case of spin-locks, GAPP will not identify bottlenecksin applications that busy wait, such as MPI applications execut-ing in aggressive mode. However, from our experience, disablinghe aggressive mode, at least in the development phase, will helpidentify existing load imbalance among participating processes.The default behaviour of gcc , of generating position-independentexecutables , needs to be overridden for the addr2line utility towork properly. This can be overcome by finding a way to map theoffset provided by the sym() primitive of bcc to the source code.Also it may be noted that, adding eBPF probes to the kernel andremoving probes from it require root privileges and hence GAPPrequires root privileges for its operation.

GAPP works “out of the box” in that it can be used to profile anapplication without the need to instrument the source code, patchthe operating system or undertake elaborate post-processing ofdata captured during a program’s execution. Moreover, the verifierin the eBPF framework ensures that the probes are safe to attach toa live kernel.GAPP is able to identify a range of different bottlenecks, for ex-ample critical sections, execution hot-spots, bottlenecks identifiedfrom hardware events, busy-wait loops and resource contention.GAPP has proven to be remarkably good at detecting bottlenecks,but in order to classify the bottleneck type, e.g. as being due to syn-chronization or I/O, more work is required. In order to automate theprocess of bottleneck classification we have recently experimentedwith tracking I/O system calls with a view to determining whichfiles, IP addresses etc. an application interacts with. We have alsoexplored tracing kernel-level synchronization (“futex”) calls whichare used by many higher-level libraries and macros. For example, bycombining GAPP’s existing criticality information with an analysisof futex ‘wakers’ it is relatively easy to distinguish critical fromnon-critical lock holders; this can help to rank multiple call pathsleading to the same lock. By accessing other hardware counters itis possible in principle to trace other sources of bottlenecks such aspage faults.It would be good to see how GAPP behaves with other parallelplatforms such as Java, Intel TBB, Cilk etc. While the core conceptsin GAPP is not attached with any language/library, we need toexperiment and see how stack traces and address mapping work inthe presence of virtual environments or intermediate schedulerspresent in such platforms.

REFERENCES [1] 2019. GAPP Development Repository. https://github.com/RN-dev-repo/GAPP.[2] 2019.

MySQL 5.7 Reference Manual . Retrieved 2019-09-15 from https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/innodb-performance-spin_lock_polling.html[3] 2019. MySQL 5.7 Reference Manual::8.5.8 Optimizing InnoDB Disk I/O. https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-diskio.html[4] Erik Altman, Matthew Arnold, Stephen Fink, and Nick Mitchell. 2010. Perfor-mance analysis of idle programs. In

ACM Sigplan Notices , Vol. 45. ACM, 739–753.[5] Glenn Ammons, Jong-Deok Choi, Manish Gupta, and Nikhil Swamy. 2004. Findingand removing performance bottlenecks in large systems. In

European Conferenceon Object-Oriented Programming . Springer, 172–196.[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. ThePARSEC benchmark suite: Characterization and architectural implications. In

Proceedings of the 17th international conference on Parallel architectures and com-pilation techniques . ACM, 72–81.[7] Chris D Cantwell, David Moxey, Andrew Comerford, Alessandro Bolis, GabrieleRocco, Gianmarco Mengaldo, Daniele De Grazia, Sergey Yakovlev, J-E Lombard,Dirk Ekelschot, et al. 2015. Nektar++: An open-source spectral/hp elementframework.

Computer physics communications

192 (2015), 205–219.[8] Guancheng Chen and Per Stenstrom. 2012. Critical lock analysis: Diagnosingcritical section bottlenecks in multithreaded applications. In

Proceedings of the international conference on high performance computing, networking, storage andanalysis . IEEE Computer Society Press, 71.[9] Jonathan Corbet. 2014. The BPF system call API, version 14. https://lwn.net/Articles/612878/[10] Charlie Curtsinger and Emery D Berger. 2015. Coz: Finding Code that Countswith Causal Profiling Charlie.

Sosp (2015), 184–197. https://doi.org/10.1145/2815400.2815409 arXiv:1608.03676[11] Florian David, Gael Thomas, Julia Lawall, and Gilles Muller. 2014. Continuouslymeasuring critical section pressure with the free-lunch profiler. In

ACM SIGPLANNotices , Vol. 49. ACM, 291–307.[12] Daniele De Sensi, Tiziano De Matteis, Massimo Torquati, Gabriele Mencagli, andMarco Danelutto. 2017. Bringing Parallel Patterns Out of the Corner: The P3ARSEC Benchmark Suite.

ACM Trans. Archit. Code Optim.

14, 4, Article 33 (Oct.2017), 26 pages. https://doi.org/10.1145/3132710[13] Mathieu Desnoyers. 2018.

Using the Linux Kernel Tracepoints @ONLINE

ACM SIGARCH Computer Architecture News , Vol. 41. ACM,511–522.[15] Kristof Du Bois, Jennifer B Sartor, Stijn Eyerman, and Lieven Eeckhout. 2013.Bottle graphs: visualizing scalability bottlenecks in multi-threaded applications.In

ACM SIGPLAN Notices , Vol. 48. ACM, 355–372.[16] Stijn Eyerman, Kristof Du Bois, and Lieven Eeckhout. 2012. Speedup stacks:Identifying scaling bottlenecks in multi-threaded applications. In

PerformanceAnalysis of Systems and Software (ISPASS), 2012 IEEE International Symposium on .IEEE, 145–155.[17] Matt Fleming. 2017. A thorough introduction to eBPF. https://lwn.net/Articles/740157/[18] Shi Han, Yingnong Dang, Song Ge, Dongmei Zhang, and Tao Xie. 2012. Perfor-mance debugging in the large via mining millions of stack traces. In

Proceedingsof the 34th International Conference on Software Engineering . IEEE Press, 145–155.[19] W. Heirman, T. E. Carlson, S. Che, K. Skadron, and L. Eeckhout. 2011. Usingcycle stacks to understand scaling bottlenecks in multi-threaded workloads. In . 38–49.https://doi.org/10.1109/IISWC.2011.6114195[20] José A Joao, M Aater Suleman, Onur Mutlu, and Yale N Patt. 2012. Bottleneckidentification and scheduling in multithreaded applications.

ACM SIGPLANNotices

47, 4 (2012), 223–234.[21] Melanie Kambadur, Kui Tang, and Martha A Kim. 2013. Parallel Block Vectors:Collection, Analysis, and Uses.

IEEE Micro

33, 3 (2013), 86–94.[22] Melanie Kambadur, Kui Tang, and Martha A Kim. 2014. Parashares: Finding theimportant basic blocks in multithreaded programs. In

European Conference onParallel Processing . Springer, 75–86.[23] Suchakrapani Datt Sharma and Michel Dagenais. 2016. Enhanced Userspace andIn-Kernel Trace Filtering for Production Systems.

Journal of Computer Scienceand Technology

31, 6 (2016), 1161–1178.[24] Nathan R Tallent and John M Mellor-Crummey. 2009. Effective performancemeasurement and analysis of multithreaded applications. In

ACM Sigplan Notices ,Vol. 44. ACM, 229–240.[25] Nathan R. Tallent, John M. Mellor-Crummey, and Allan Porterfield. 2010. Ana-lyzing Lock Contention in Multithreaded Applications. In

Proceedings of the 15thACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP ’10) . ACM, New York, NY, USA, 269–280. https://doi.org/10.1145/1693453.1693489[26] Mohammad Mejbah ul Alam, Tongping Liu, Guangming Zeng, and AbdullahMuzahid. 2017. Syncperf: Categorizing, detecting, and diagnosing synchroniza-tion performance bugs. In

Proceedings of the Twelfth European Conference onComputer Systems (EuroSysâĂŹ 17). ACM, New York, NY, USA . 298–313.[27] Adarsh Yoga and Santosh Nagarakatte. 2017. A fast causal profiler for task parallelprograms. arXiv preprint arXiv:1705.01522 (2017).[28] Wucherl Yoo. 2012.

Automated performance characterization of applications usinghardware monitoring events . University of Illinois at Urbana-Champaign.[29] Tingting Yu and Michael Pradel. 2016. SyncProf: detecting, localizing, and op-timizing synchronization bottlenecks. In

Proceedings of the 25th InternationalSymposium on Software Testing and Analysis . ACM, 389–400.[30] Xiao Yu, Shi Han, Dongmei Zhang, and Tao Xie. 2014. Comprehending Perfor-mance from Real-world Execution Traces: A Device-driver Case. In

Proceedings ofthe 19th International Conference on Architectural Support for Programming Lan-guages and Operating Systems (ASPLOS ’14) . ACM, New York, NY, USA, 193–206.https://doi.org/10.1145/2541940.2541968[31] Fang Zhou, Yifan Gan, Sixiang Ma, and Yang Wang. 2018. wPerf: generic Off-CPUanalysis to identify bottleneck waiting events. In { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI }18)