[PDF] ScalAna: Automating Scaling Loss Detection with Graph Analysis

Abstract

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, Amdahl's law, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs low overheads but does not capture detailed dependencies needed for root-cause analysis. Tracing collects all information at prohibitive overheads. In this work, we design ScalAna that uses static analysis techniques to achieve the best of both worlds - it enables the analyzability of traces at a cost similar to profiling. ScalAna first leverages static compiler techniques to build a Program Structure Graph, which records the main computation and communication patterns as well as the program's control structures. At runtime, we adopt lightweight techniques to collect performance data according to the graph structure and generate a Program Performance Graph. With this graph, we propose a novel approach, called backtracking root cause detection, which can automatically and efficiently detect the root cause of scaling loss. We evaluate ScalAna with real applications. Results show that our approach can effectively locate the root cause of scaling loss for real applications and incurs 1.73% overhead on average for up to 2,048 processes. We achieve up to 11.11% performance improvement by fixing the root causes detected by ScalAna on 2,048 processes.

Full PDF

SS C A L A NA : Automating Scaling Loss Detectionwith Graph Analysis Yuyang Jin ∗ , Haojie Wang ∗ , Teng Yu ∗ , Xiongchao Tang ∗ , Torsten Hoeﬂer † , Xu Liu ‡ , Jidong Zhai ∗∗ Tsinghua University, † ETH Zrich, ‡ North Carolina State University { jyy17, wang-hj18 } @mails.tsinghua.edu.cn, [email protected], [email protected],[email protected], [email protected], [email protected] Abstract —Scaling a parallel program to modern supercomput-ers is challenging due to inter-process communication, Amdahl’slaw, and resource contention. Performance analysis tools forﬁnding such scaling bottlenecks either base on proﬁling ortracing. Proﬁling incurs low overheads but does not capturedetailed dependencies needed for root-cause analysis. Tracingcollects all information at prohibitive overheads.In this work, we design S

CAL A NA that uses static analysistechniques to achieve the best of both worlds - it enables theanalyzability of traces at a cost similar to proﬁling. S CAL A NA ﬁrst leverages static compiler techniques to build a ProgramStructure Graph , which records the main computation andcommunication patterns as well as the program’s control struc-tures. At runtime, we adopt lightweight techniques to collectperformance data according to the graph structure and generatea

Program Performance Graph . With this graph, we propose anovel approach, called backtracking root cause detection , whichcan automatically and efﬁciently detect the root cause of scalingloss. We evaluate S

CAL A NA with real applications. Results showthat our approach can effectively locate the root cause of scalingloss for real applications and incurs 1.73% overhead on averagefor up to 2,048 processes. We achieve up to 11.11% performanceimprovement by ﬁxing the root causes detected by S CAL A NA on2,048 processes. Index Terms —Performance Analysis, Scalability Bottleneck,Root-Cause Detection, Static Analysis

I. I

NTRODUCTION

A decade after Dennard scaling ended and clock frequencieshave stalled, increasing core count remains the only optionto boost computing power. Top-ranked supercomputers [1]already contain millions of processor cores, such as ORNL’sSummit with 2,397,824 cores, LLNL’s Sierra with 1,572,480cores, and Sunway TaihuLight with 10,649,600 cores. Thisunprecedented growth in the last years shifted the complexityto the developers of parallel programs, for which scalability isa main concern now. Unfortunately, not all parallel programshave caught up with this trend and cannot efﬁciently use mod-ern supercomputers, mostly due to their poor scalability [2],[3].Scalability bottlenecks can have a multitude of reasonsranging from issues with locking, serialization, congestion,load imbalance, and many more [4], [5]. They often manifestthemselves in synchronization operations and ﬁnding the exactroot cause is hard. Yet, with the trend towards larger core countcontinuing, scalability analysis of parallel programs becomesone of the most important aspects of modern performance engineering. Our work squarely addresses this topic for large-scale parallel programs.TABLE I: Qualitative performance and storage analysis onstate-of-the-art and S

CAL A NA running NPB-CG with CLASSC for 128 processes [6] Tools Approaches Time Overhead Storage CostScalasca [7] Tracing-based 25.3% (w/o I/O) 6.77 GBHPCToolkit [8] Proﬁling-based 8.41% 11.45 MBS

CAL A NA Graph-based 3.53% 314 KB

Researchers have made great efforts in scalability bottleneckidentiﬁcation using three fundamental approaches: applicationproﬁling, tracing, and modeling.

Proﬁling-based approaches [9], [10], [11] collect statisticalinformation at runtime with low overhead. Summarizing thedata statistically loses important information such as the orderof events, control ﬂow, and possible dependence and delaypaths. Thus, such approaches can only provide a coarse insightinto application bottlenecks and substantial human efforts arerequired to identify the root cause of scaling issues.

Tracing-based approaches [12], [13], [7], [14] capture per-formance data as time series, which allows tracking depen-dence and delay sequences to identify root causes of scalingissues. Their major drawback is the often prohibitive storageand runtime overhead of the detailed data logging. Thus,such tracing-based analysis can often not be used for large-scale programs. For example, we show the performance andstorage overhead of the NPB-CG running with 128 processescomparing with tracing and proﬁling in Table I (Note that it isa single run for overhead comparison but not a typical use-casefor scalability bottleneck identiﬁcation.). Both proﬁling-basedapproaches and tracing-based approaches can use samplingtechniques to reduce overhead but with a certain accuracy loss.

Modeling-based approaches [15], [16], [17], [18], [19],[20], [21] can also be used to identify scalability bottleneckswith low runtime overhead. However, building accurate per-formance models often requires signiﬁcant human efforts andskills. Furthermore, establishing full performance models fora complex application with many input parameters requiresmany runs and prohibitively expensive [22]. Thus, we con-clude that identifying scalability bottlenecks for large-scaleparallel programs remains an important open problem. a r X i v : . [ c s . PF ] S e p .c *.F files Graph Generation

Static Program Analysis

Compile-timeIntra/Inter-procedural Analysis

Runtime Profiling Data Embedding

Scaling Loss Detection

Detected Problematic Vertices

Sampling-based Profiling

Source Code Program Structure Graph (PSG) Contracted PSGExecutable Binary Program Performance Graph (PPG)

Performance Data & Communication Dependence

Graph Contraction Problematic Vertex Detection Backtracking Root Cause Detection

Root

BranchComp

COMM

Loop

RootLoopLoopLoop

Comp

COMM

Loop

BranchBranch

Loop

Identified Root Cause root cause

Fig. 1: Overview of S

CAL A NA To accurately identify scalability problems with low effortand overhead, we consider the program structure during dataproﬁling. S

CAL A NA combines static program analysis withdynamic sampling-based proﬁling into a light-weight mech-anism to automatically identify the root cause of scalabilityproblems for large-scale parallel programs. We utilize an intra-and inter-procedural analysis of the source-code structure andrecord dynamic message matching at runtime to establishan efﬁcient dependence graph of the overall execution. Inparticular, S CAL A NA is able to detect latent scaling issuesin complex parallel programs where the delay will propagateto the other processes after several time steps through massivecommunication dependence. In summary, there are three maincontributions in our work: • We design a ﬁne-grained

Program Structure Graph (PSG)that represents a compressed form of all program depen-dence within and across parallel processes. Then we gener-ate

Program Performance Graph (PPG) by enhancing PSGfor each execution combining static compile-time analysiswith light-weight runtime proﬁling. • Based on the PPG, we design a location-aware algorithm todetect problematic vertices with scaling issues. Combininginter-process dependence chains, we further propose a novelgraph analysis algorithm, called backtracking root causedetection , to ﬁnd their root cause in source code. • We implement a light-weight performance tool namedS

CAL A NA , and evaluate it with real applications. Resultsshow that S CAL A NA can effectively and automaticallyidentify the root cause of scalability problems.We evaluate S CAL A NA with both benchmarks and realapplications. Experimental results show that our approachcan identify the root cause of scalability problems for realapplications more accurately and effectively comparing withHPCToolkit [8] and Scalasca [7]. S CAL A NA only incurs1.73% overhead on average for evaluated programs up to 2,048processes. We achieve up to 11.11% performance improve-ment by ﬁxing the root causes detected by S CAL A NA on 2,048processes. II. D ESIGN O VERVIEW

One main innovation of S

CAL A NA is to build a ProgramStructure Graph (PSG) at compile time and use it during S CAL A NA is available at: https://github.com/thu-pacman/SCALANA . runtime to minimize tracing overheads. The PSG capturesthe main computation and communication patterns that canbe extracted statically from a parallel program. During theexecution, S CAL A NA collects light-weight performance dataas PSG vertex attributes as well as communication dependencebetween different processes and ﬁnally forms a ProgramPerformance Graph (PPG). Another innovation of S

CAL A NA is that we leverage the features of the generated PPG tolocate problematic vertices and then we use graph analysisto automatically identify the root cause of scaling issues inthe source code.In general, S CAL A NA consists of two main modules, graphgeneration and scaling loss detection . Figure 1 shows the high-level workﬂow of our system. Graph generation contains twophases, static program analysis and sampling-based proﬁling.Static program analysis is done at compile time while thesampling-based proﬁling is performed at runtime. We usethe LLVM compiler [23] to automatically build a PSG. Eachvertex on the program structure graph is corresponding to acode snippet in the source code. The scaling loss detection isan ofﬂine module, which includes problematic vertex detectionand root-cause analysis. We describe several key steps of thesetwo modules below. Graph Generation • Program Structure Graph (PSG).

The input of this moduleis the source code of a parallel program. Through an intra-and inter-procedural static analysis of the program, we geta preliminary

Program Structure Graph (Section III-A). • Graph Contraction.

In this step, we remove unnecessaryedges in the PSG and merge several small vertices into alarge vertex to reduce scalability analysis overhead (Sec-tion III-A). • Performance Data and Communication Dependence.

Toeffectively detect the scalability bottleneck, we leveragesampling techniques to collect the performance data for eachvertex of the PSG and communication dependence data withdifferent numbers of processes (Section III-B). • Program Performance Graph (PPG).

To analyze the inter-play of computation and communication among differentprocesses, we further generate a

Program PerformanceGraph based on per-process PSGs (Section III-C).

Scaling Loss Detection • Problematic Vertex Detection.

According to the structuref the acquired PPG, we design a location-aware detectionapproach to identify all problematic vertices (Section IV-A). • Backtracking Root Cause Detection.

Combined with identi-ﬁed problematic vertices, we propose a backtracking algo-rithm on top of the PPG and identify all the paths coveringproblematic vertices, which can help locate the root causeof the scaling issues (Section IV-B).

MPI Data/Control Dependence7 Loop7 Communication Dependence (a) NPB-CG

MPI Process ID for(…){}for(…) {

MPI_Sendrecv for(…){}}…for (…) {

MPI_Sendrecv }for (…){}for (…) {

MPI_Sendrecv }for (…){}

Execution Order7 Loop7 Communication Transfer0 1 2 3 4 5 6 7 (b) Partial PPG of NPB-CG (c) Backtracking root cause detection

0 1 2 3 4 5 6 7Process ID

Fig. 2: Motivating example with NPB-CGWe give an example to show how our approach is usedto detect scaling loss. Figure 2 shows the code snippet ofNPB-CG [6] and the partial PPG of NPB-CG generated byS

CAL A NA (due to space limitation, we only draw the PPGfor 8 processes). We manually inject a delay in one process(process ), which causes a scaling loss on the Tianhe-2 system(49.4 seconds at 1,024 processes vs. 49.5 seconds at 2,048processes). Tracing-based approaches like Scalasca [7] andVampir [24] generate more than 250 GB of trace data. Dueto the covert performance issue mixed by data, control, andinter-process communication dependence, we observe that thetraditional proﬁling-based tool, like HPCToolkit [8], needssigniﬁcant human efforts to identify the accurate root causefor this case.In S CAL A NA , we leverage both the static and dynamicanalysis to build a holistic PPG that records program executionorder and data ﬂow as well as inter-process communicationtransfer, as shown in Figure 2(b). With our detection algorithm,we ﬁrst identify some problematic vertices in this graph.For example, the vertices are marked with red, blue, yellow,or green color. In general, a problematic vertex is a vertexwith unusual performance relative to other vertices. Thenwe perform a backtracking root cause detection on the PPGof NPB-CG, as shown in Figure 2(c). Through backwardtraversing this graph, we can detect the red vertex of process is the root cause through a path of vertices that traversesdifferent processes.In summary, S CAL A NA is a programmer-oriented scalabil-ity analysis tool, which takes input as the source code of aparallel program, detects the root cause of scaling bottlenecksand reports back to the programmer which lines of the sourcecode cause the problems to guide further optimization on theprogram. III. G RAPH G ENERATION

In this section, we describe how we automatically build anappropriate representation to reﬂect the main computation andcommunication characteristics for a given parallel program indetail. Our approach mainly relies on a static program analysismodule. It also incorporates a sampling-based proﬁling moduleto handle input-dependent information.

A. Static Program Structure Graph Construction

In general, the static analysis module is in charge of buildinga per-process PSG, which can be regarded as a sketch of aparallel program. In a PSG, the vertices represent main com-putation and communication components as well as programcontrol ﬂow. The edges represent their execution order basedon both data and control ﬂow. We group the vertices intodifferent types, including

Branch , Loop , Function call , and

Comp , among which,

Comp is a collection of computationinstructions while the others are basic program structures.There are three main phases to build a PSG statically: intra-procedural, inter-procedural analysis, and graph contraction.During the intra-procedural analysis, we ﬁrstly build a localPSG for each function. And then through an inter-proceduralalgorithm, we acquire a complete PSG, which will be furtherreﬁned by graph contraction.

Intra-procedural Analysis

During the intra-procedural anal-ysis phase, we build a local PSG for each procedure. Thebasic idea is that we traverse the control ﬂow graph of theprocedure at the level of the intermediate representation (IR)of the program, identify loops, branches, and function calls,and then connect these components based on their dependenceto form a per-function local graph.

Inter-procedural Analysis

Inter-procedural analysis is tocombine all the local PSGs into a complete graph. We startby analyzing the program’s call graph (PCG), which containsall calling relationships between different functions. And thenwe perform a top-down traversal of the PCG from the main function and replace all user-deﬁned functions with their localPSGs. For MPI function calls, we just keep them. For indirectfunction calls, we need to process them after collecting certainfunction call relationships at runtime. For recursive functioncalls, their edges are similar to the recursive call edges in thePCG, which means that a circle is formed in the PSG. After thestatic analysis, the runtime performance data will be attachedto these vertices with extra call-stack information for furtheranalysis.

PSG Contraction

The PSGs generated in the above step arenormally too large to be applied efﬁciently for real applicationssince we need to create corresponding vertices for any loopand branch in their source code. However, the workloadof some vertices can be ignored as collecting performancedata for these vertices only introduces large overhead withoutbeneﬁts. To address this problem, S

CAL A NA performs graphcontraction to reduce the size of the generated PSG.The rules of contraction affect the granularity of thegraph and the representation of communication and com-putation characteristics. Considering that communication isormally the main scalability bottleneck for parallel programs,S CAL A NA preserves all MPI invocations and related controlstructures. For computation vertices in the PSG, we mergecontinuous vertices into a larger vertex. Speciﬁcally, for thestructures that do not include MPI invocations, we only pre-serve Loop because computation produced by loop iterationsmay dominate performance. In addition, S

CAL A NA allows auser-deﬁned parameter, MaxLoopDepth , as a threshold tolimit the depth of nested loops and keep the graph condensed. int main () { for ( int i = ; i < N ; ++ i ) //Loop 1 A [ i ] = rand (); for ( int j = ; j < i ; ++ j ) //Loop 1.1 sum += A [ j ]; for ( int k = ; k < i ; ++ k ) //Loop 1.2 product *= A [ k ]; foo (); MPI_Bcast (...); } void foo () { if ( myRank % == ) MPI_Send (...); else MPI_Recv (...); } Fig. 3: An MPI program example mainLoop 1MPI_BcastfoofooBranch MPI_Send MPI_RecvLoop 1.2Loop 1.1 RootLoop 1.1MPI_BcastBranch MPI_Send MPI_RecvLoop 1.2Loop 1 RootMPI_BcastBranch MPI_Send MPI_RecvCompLoop 1 (a)The local PSGgenerated by intra-procedural analysis (b)The complete PSG generated by inter-procedural analysis (c)The final contracted PSG

Fig. 4: Static Program Structure Graph GenerationFor instance, Figure 3 shows a simple MPI program exam-ple with two functions. Figure 4(a) shows its local PSGs foreach function after the intra-procedural analysis. Figure 4(b)shows a complete PSG after the inter-procedural analysis.Figure 4(c) shows the contracted PSG after merging thesequential

Loop1.1 and

Loop1.2 when

MaxLoopDepth is set to 1.

B. Sampling-Based Proﬁling S CAL A NA is a hybrid approach. We design a sampling-based proﬁling module to annotate the PSG with proﬁlingdata and also reﬁne it based on runtime information. Thesampling-based proﬁling module includes performance pro-ﬁling, inter-process dependence proﬁling, and indirect callanalysis. Performance proﬁling is to collect and ﬁll runtimemetrics into the vertices of the graph to handle input-dependentinformation. We use a sampling technique for performance proﬁling (Section III-B1). Inter-process dependence proﬁlingis to connect per-function PSG into a larger graph (PPG) thatcannot be derived statically (Section III-B2).

1) Associate Vertices with Performance Data:

We collectthe performance data for each vertex of the PSG at runtime,which is essential for further analysis of scaling issues. Un-like traditional coarse-grained proﬁling approaches, S

CAL A NA collects performance data according to the granularity of eachPSG vertex. One main advantage is that we can combinethe graph structure and performance data for more accurateperformance analysis. Speciﬁcally, we associate each PSGvertex with a performance vector that records the executiontime and key hardware performance data, such as cache missrate and branch miss count.We use sampling techniques for performance proﬁling tocollect metrics with very low overhead. We use PAPI [25]for sampling and hardware performance data collection, whichinterrupts the program at regular clock cycles and records pro-gram call stack and related performance data. According to theprogram call stack information, we can associate performancedata with the corresponding PSG vertex at the interruptionpoint.

2) Graph-Guided Communication Dependence:

During thestatic analysis, we derive data and control dependence withineach process. At runtime, we need to further collect com-munication dependence between different processes for inter-process dependence analysis. Traditional tracing-based ap-proaches record each communication operation and analyzetheir dependence, which causes large collection overheadand also huge storage cost [26], [27]. We propose two keytechniques to address this problem: sampling-based instrumen-tation and graph-guided communication compression.

Sampling-Based Instrumentation

Full instrumentation al-ways introduces large overheads. The dynamic program behav-ior may be missed if the instrumentation is recorded only once.To reduce the runtime overhead and still capture the dynamicprogram behavior along with the program execution, we adopta random sampling-based instrumentation technique [28]. Arandom number is generated every time when the instrumen-tation is executed. When the random number falls into aninterval of the pre-deﬁned threshold we record communicationparameters. The random sampling technique used here canavoid missing regular communication patterns as much aspossible even if they change at runtime.

Graph-Guided Communication Compression

A typicalparallel program contains a large number of communicationoperations. Due to the redundancy between different loopiterations, we do not need to record all the communicationoperations. As the PSG already represents the program’s high-level communication structure, we can leverage this graph toreduce communication records. We only record communica-tion operation parameters once for repeated communicationswith the same parameters of the recorded data, which canreduce the storage cost and ease the analysis of inter-processdependence.We use PMPI [29] in this work for effective communication map < MPI_Request *, pair < int , int >> requestConverter ; int MPI_Irecv (..., int source , int tag , ..., MPI_Request * request ) { requestConverter [ request ] = < source , tag >; return PMPI_Irecv (...); } int MPI_Wait ( MPI_Request * request , MPI_Status * status ) { retval = PMPI_Wait ( request , status ); < source , tag > = requestConverter [ request ]; if ( source or tag is uncertain) { commSet . insert (< status.MPI_SOURCE , status.MPI_TAG >); } else { commSet . insert (< source , tag >); } return retval ; } Fig. 5: Acquiring communication dependence for non-blocking communicationscollection, which does not need to modify the source code. Fordifferent communication types, we adopt different methods tocollect their dependence. We distinguish three common classesof communication: (1) For collective communication, weshould know which processes are involved in this communica-tion. In MPI programs, we can use

MPI_Comm_get_info to acquire this information. (2) For blocking point to pointcommunication, we should record the source or dest processand tag directly. (3) For non-blocking communication, someinformation will not be available until ﬁnal checking functionsare invoked (such as MPI_Wait ).We take

MPI_Wait after

MPI_Irecv as an example asshown in Figure 5. Firstly, we store the source process and tag from the parameters associated to the request in MPI_Irecv .Then in

MPI_Wait , the source and tag corresponding to the request are recorded into a communication dependence set.If the source or tag is uncertain, we acquire them from theparameter of status in MPI_Wait .

3) Indirect Function Calls:

Sometimes, the program callgraph cannot be fully obtained by the static analysis due toindirect calls, such as function pointers. We need to collect thecalling information of indirect calls at runtime and ﬁll suchinformation into the graph. We do necessary instrumentationbefore the entry and exit of indirect calls and link thisinformation with real function calls with unique function IDsand then reﬁne the PSG obtained after the inter-proceduralanalysis.

C. Program Performance Graph

After both the static program analysis and sampling-basedproﬁling, we build a ﬁnal

Program Performance Graph (PPG).As each process shares the same source code, we can duplicatethe PSG for all processes. Then we add inter-process edgesbased on communication dependence collected at the runtimeanalysis. For point to point communications, we match thesending and receiving processes. For collective communica-tions, we associate all involved processes. Figure 6 shows asimpliﬁed ﬁnal PPG for an example program running with 8processes.Note that the ﬁnal PPG not only includes the data andcontrol dependence for each process but also records the (c) the PPG on 8 processes for (…) {} …

MPI_Sendrecv () for (…) {} … for (…) { MPI_Sendrecv ()} for (…) {

MPI_Sendrecv ()}} (a) code

Comp MPILoop (b) PSG

Time:3123TOT_INS:76474453TOT_LST:2438723

PerformanceData

ExecutionOrder Edge

Fig. 6: A PPG running with 8 processesinter-process communication dependence. In addition, we alsoattribute key performance data for each vertex, which will beused for further scaling issue detection. For a given vertexin this graph, its performance can be affected by either itsown computation patterns or the performance of other verticesconnected through data and control dependence within oneprocess as well as communication dependence between dif-ferent processes. We describe how we locate the performanceissue below. IV. S

CALING L OSS D ETECTION

In this section, we describe how we leverage the acquiredPPG for effective and automatic scaling loss detection. Ourapproach consists of two key steps, location-aware problematicvertex detection and backtracking root cause identiﬁcation.The former is to detect problematic vertices with poor scala-bility or abnormal behavior. The latter is to pinpoint the rootcause of scaling loss problems.

A. Location-Aware Problematic Vertex Detection

One main advantage of our approach is that we havegenerated a ﬁnal PPG from a given program. Although theinter-process communication dependence may change with thedifferent numbers of processes, the per-process PSG does notchange with the problem size or job scale. Based on thisobservation, we propose a location-aware detection approachto identify problematic vertices. The core idea of our approachis that we compare the performance data of vertices in the PPGwhich corresponds to the same vertex in the PSG among dif-ferent job scales (non-scalable vertex detection) and differentprocesses for a given job scale (abnormal vertex detection).

Non-Scalable Vertex Detection

The core idea is to ﬁndvertices in the PPG whose performance (execution time orhardware performance data) shows an unusual slope com-paring with other vertices when the number of processesincreases. For instance, Figure 7(a) shows the change of theexecution time of different vertices in a PSG as the processcount increases. The execution time of the vertex in the redline does not decrease like other vertices. When the executiontime of these vertices accounts for a large proportion of thetotal time, they will become a scaling issue.A challenge for non-scalable vertex detection is how tomerge performance data from a large number of processes.The simplest strategy is to use the performance data for aarticular process for comparison but this strategy may losesome information about other processes. Another strategy isto use the mean or median value of performance data fromall processes and the performance variance among differentprocesses to reﬂect load distribution. We can also partitionall processes into different groups by clustering algorithmsand then aggregate for each group. In our implementation, wetest all strategies mentioned above and ﬁt the merged data ofdifferent process counts with a log-log model [30]. With theseﬁtting results, we sort all vertices by the changing rate ofeach vertex when the scale increases and ﬁlter the top-rankedvertices as the potential non-scalable vertices.

Non-Scalable Vertex (a) A non-scalable vertex example

Abnormal Vertices (b) An abnormal vertex example

Fig. 7: Two kinds of problematic vertices examples

Abnormal Vertex Detection

For a given job scale, we canalso compare the performance data of the same vertex amongdifferent processes. Since for typical SPMD (Single ProgramMulti-Data) programs, the same vertex tends to execute thesame workload among different processes. If a vertex hassigniﬁcantly different execution time, we can mark this vertexas a potential abnormal vertex. A lot of reasons can causeabnormal vertices, even if we do not consider the effect of per-formance variance [31]. For instance, a load balance problemcan cause abnormal vertices in some processes. We can alsoidentify some communication vertices that have much largersynchronization overhead than other processes. Figure 7(b)shows the execution time of the vertices of all processes ina PPG which correspond to the same vertex in the PSG on 16processes. Among them, process 4 and 6 take longer to executethan the others and yield the abnormal vertices. S

CAL A NA allows a user-deﬁned threshold AbnormThd to distinguishboth abnormal and normal vertices among different parallelprocesses. We discuss details in Section VI-D.As shown in Figure 8, after the analysis of the above twosteps, we mark some problematic vertices in the PPG (verticeswith blue and red color) generated in Figure 6.

B. Backtracking Root Cause Detection

Furthermore, we need to connect the identiﬁed problematicvertices and ﬁnd the causal relationship between them tolocate the root cause of the scaling problem. In this work,we use graph analysis to propose a novel approach, namedbacktracking root cause detection to automatically report theline number of source code corresponding to the root cause.To do the backward traversal, ﬁrst we need to reverse alledges to dependence edges. The pseudo-code of backtracking root cause algorithm is shown in Algorithm 1. Our algorithmstarts from the non-scalable vertices detected in the abovestep, then tracks backward through data/control dependenceedges within a process and communication dependence edgesamong different processes until the root vertices or collectivecommunication vertices are found. If unscanned

Loop or Branch vertices are found during the backtrack, our algorithmwill traverse only the control dependence edges but not thedata dependence edges. For example, when a

Loop vertex isfound, the traversal continues from the end vertex of this loop.One observation is that a complex parallel program alwayscontains a large number of dependence edges. So the searchcost will be very high if we would not optimize. However, wedo not need to traverse all the possible paths to identify theroot cause. In S

CAL A NA , we only preserve the communicationdependence edge if a waiting event exists while we pruneother communication dependence edges. The advantage of ourapproach is that we can reduce both searching space and falsepositives. Finally, we get several causal paths that connect aset of abnormal vertices. Further analysis of these identiﬁedpaths will help application developers to locate the root cause. Algorithm 1:

Backtracking Root Cause Algorithm

Input:

A Program Performance Graph

PPG , A Set of Non-ScalableVertices N , A Set of Abnormal Vertices A . Output:

A Set of Root Cause Paths S . Function

Main() : S ← ∅ ; V ← ∅ ; // Set of scanned vertices forall n ∈ N do P ← ∅ ; // Root cause path Backtracking ( n , P ) ; Insert P into S ; Insert all v ∈ P into V ; forall a ∈ A and a / ∈ V do // Traverse the verticesthat have not been scanned P ← ∅ ; Backtracking ( a , P ) ; Insert P into S ; return S ; Function

Backtracking( v , P ) : while v is not root or collective communication vertex do Insert v into P ; if v is an MPI vertex then v ← the dest vertex of inter-process communicationdependence edge of v ; else if v is an unscanned LOOP or BRANCH vertex then v ← the dest vertex of control dependence edge of v ; else v ← the dest vertex of data dependence edge of v ; For example, in Figure 8, we start from the abnormal vertex a in the lower-left corner, and track through a communicationdependence edge to vertex b in process . Then we canbacktrack through the data dependence edge to vertex c inprocess 2. We repeat the above steps and ﬁnally identify a pathwith the red color lines connecting all the abnormal verticesin the processes of , , and . With a similar approach, webacktrack from the other two abnormal vertices, and thentwo extra paths are identiﬁed in Figure 8, shown as bluend green respectively. With these identiﬁed paths, we canconnect different abnormal vertices including MPI invocationsand computation components together, and identify the rootcause of scaling loss. ca b Process ID

Abnormal VertexNon-Scalable VertexDependence Edge

Fig. 8: Problematic vertices and backtracking root causedetection on PPGNote that some vertices may be both non-scalable andabnormal vertices. The interplay of non-scalable and abnormalvertices can make the program performance even harder tounderstand. Sometimes, optimizing the performance of somevertices in identiﬁed paths can also improve the overall per-formance of the non-scalable vertex.V. I

MPLEMENTATION AND U SAGE

For the static analysis module of S

CAL A NA , we use LLVM-3.3.0 and Dragonegg-3.3.0 [23] for PSG generation and pro-gram instrumentation. For the sampling-based proﬁling, weuse PAPI-5.2.0 [25], [32] to collect hardware performancedata, and the PMPI interface to collect communication de-pendencies. With both static PSG and dynamic proﬁling data,S CAL A NA generates PPGs and performs scaling loss detectionpost-mortem.In general, there are four main steps for end-usersto use S CAL A NA : (1) Compiling applications with ScalAna-static to generate the PSG. (2) Runningthe instrumented applications with

ScalAna-prof fordifferent process numbers to collect proﬁling data. (3)Using

ScalAna-detect to automatically detect theroot cause of scaling loss. (4) We also provide a GUI inS

CAL A NA , ScalAna-viewer , to show the code snippetscorresponding to the diagnosed root causes. Besides, users canadjust some user-deﬁned parameters like

MaxLoopDepth and

AbnormThd to make a trade-off between detectionprecision and system overhead.Figure 9 shows a screenshot of S

CAL A NA ’s GUI. The upperwindow lists the root cause vertices and their calling paths. Thelower window shows the code snippets corresponding to thevertices. The root causes can be further sorted according to thelength of execution time and the imbalance among differentparallel processes.S CAL A NA currently only supports MPI-based programs inC or Fortran. However, all phases in S CAL A NA (programstructure extraction, proﬁling data collection, and root-causedetection) are general enough to be adapted to other message-passing programs. In addition, our approach can be also Fig. 9: GUI of S CAL A NA extended to other programming models such as OpenMP orPthreads with additional proﬁling techniques. We leave it forfuture work. VI. E VALUATION

A. Experimental Setup

Experimental Platforms

We perform the experiments ontwo testbeds: (1) Gorgon, a cluster with dual Intel XeonE5-2670(v3) and 100Gbps 4xEDR Inﬁniband. (2) Tianhe-2supercomputer. Each node of Tianhe-2 has two Intel Xeon E5-2692(v2) processors (24 cores in total) and 64 GB memory.The Tianhe-2 supercomputer uses a customized high-speedinterconnection network.

Evaluated Programs

We use a variety of parallel programs toevaluate the efﬁcacy of S

CAL A NA , including BT, CG, SP, EP,FT, MG, LU, and IS, from the widely used NPB benchmarksuite [6], plus three real world applications, Zeus-MP [33],SST [34], and Nekbone [35]. For NPB programs, problemsize CLASS C is used on Gorgon and CLASS D is used onthe Tianhe-2 supercomputer. Methodology

In our evaluation, we ﬁrst analyze the commonfeatures of the generated program structure graphs for eachprogram, and then we present the performance overhead ofour tool, including runtime overhead and storage cost (Theseexperiments are for overhead comparison but not typical use-cases for scaling issue detection.). Finally, we use three realapplications to demonstrate the beneﬁt of our approach. Forall experiments, we run three times and average the results foreach process scale to reduce performance variance.We compare our approach with two state-of-the-art perfor-mance tools, HPCToolkit [8] and Scalasca [7]. To ensure thefairness of comparison, we give the detailed conﬁguration ofthese two state-of-the-art tools: (1) For the tracing-based tool,Scalasca(v2.5), we ﬁrst use its proﬁling function to identifywhere detailed tracing is needed, then we run small scalejobs with limited instrumentation. And increase the processcount and the instrumentation complexity iteratively until thescalability bottlenecks are identiﬁed. In this way, Scalasca in-troduces as little storage cost as possible. (2) For the proﬁling-based tool, HPCToolkit(v2019.08), the sampling frequency isthe key parameter that affects the runtime overhead. S

CAL A NA eeps the same sampling frequency (200Hz) as HPCToolkit inall experiments. For all experiments, MaxLoopDepth is setto 10 and

AbnormThd is set to 1.3 empirically.TABLE II: Code size and vertices information of PSG forevaluated programs.

Loop , Branch ,

Comp , and

MPI are the number of

Loop , Branch , Comp , and

MPI vertices respectively.

Program Code(KLoc)

Loop

Branch

Comp

MPI

BT 9.3 974 377 39 57 176 103CG 2.0 431 190 18 10 95 66EP 0.6 91 32 4 2 13 12FT 2.5 4,285 241 15 22 118 35MG 2.8 7,842 1,973 177 233 942 463SP 5.1 734 278 13 34 138 89LU 7.7 2,370 663 18 66 327 237IS 1.3 240 55 1 3 28 19SST 40.8 23,608 5,217 321 641 1,434 1,303NEKBONE 31.8 1,289 944 239 162 423 83ZEUS-MP 44.1 273,715 64,570 1,677 1,304 30,099 11,818

B. PSG Analysis

Table II summarizes the code size and the vertices countfor all generated PSGs. Results include the number of linesof source code, the number of vertices before and after graphcontraction, the number of

Comp vertices, and the numberof

MPI vertices. In our experiments, the total vertex countcorrelates with the number of lines of source code in mostcases. Graph contraction reduces the number of vertices by68% on average. Furthermore,

Comp and

MPI vertices makeup more than 73% of all vertices, which indicates that thePSG can fully represent computation and communicationcharacteristics.

C. Performance Overhead

We evaluate S

CAL A NA on the Tianhe-2 supercomputer withup to 2,048 processes and the comparison experiments withScalasca and HPCToolkit are run on Gorgon with up to 128processes due to the installation limitation of the Tianhe-2supercomputer’s external network.TABLE III: The static overhead of S CAL A NA on Gorgon Programs BT CG EP FT MG SP LU IS SST NEK ZMPOvd(%) 0.32 0.77 0.38 0.35 0.29 0.31 0.28 0.68 3.01 0.43 2.96

Static Overhead

We ﬁrst evaluate the compilation overheadintroduced by static analysis on Gorgon. As shown in Table III,S

CAL A NA only incurs very low compilation overhead compar-ing to the original LLVM compilation cost (0.28% to 3.01%,0.89% on average). Besides, the memory cost of static analysisis in proportion to the size of PSG. For example, each vertexof the PSG occupies 32B of memory on Gorgon and the staticanalysis incurs about 9MB in addition for Zeus-MP. Runtime Overhead

The runtime overhead of S

CAL A NA isshown as the gray bars in Figure 10, which is the average Fig. 10: Average runtime overhead of Scalasca [7], HPC-Toolkit [8], and S CAL A NA with 4 to 128 processes (withoutI/O)overhead of 4 to 128 MPI processes (4 to 121 processes for BTand SP, due to its requirement for process counts). As shownin Figure 10, S CAL A NA only brings very small overheadranging from 0.72% to 9.73%, average at 3.52% on Gorgon,which is much lower than Scalasca [7]. For Scalasca, the tracebuffer size ( SCOREP_TOTAL_MEMORY ) is conﬁgured largeenough to avoid intermediate trace ﬂushing before the programends. Besides, for S

CAL A NA , the average runtime overheadof the NPB benchmark with 2,048 processes on the Tianhe-2supercomputer is 1.73%.Fig. 11: Storage cost of Scalasca [7], HPCToolkit [8], andS CAL A NA running with 128 processes Storage Cost

Figure 11 shows the storage costs of S

CAL A NA ,HPCToolkit, and Scalasca running with 128 processes (121 forBT and SP) on Gorgon. S CAL A NA only incurs storage costs inthe order of Kilobytes, while Scalasca and HPCToolkit gener-ate Megabytes to Gigabytes of data. Besides, for S CAL A NA ,the average storage cost of the NPB benchmark with 2,048processes on the Tianhe-2 supercomputer is 4.72MB.TABLE IV: The post-mortem detection cost of S CAL A NA with 128 processes Programs BT CG EP FT MG SP LU IS SST NEK ZMPCost(Sec.) 3.26 1.74 0.29 2.20 1.80 2.40 6.06 0.50 9.54 8.63 11.81

Post-mortem Detection Cost

We evaluate the cost of back-tracking root cause detection in S

CAL A NA on Gorgon. Asshown in Table IV, the scaling loss detection only introduceslittle cost comparing to the execution time of the program(up to 11.81 seconds, 8.44% of the execution time) on 128processes. The memory consumption of post-mortem detectionis proportional to the program structure and the size ofproﬁling data (about 50MB for Zeus-MP on 128 processes). udt.F :227 MPI_Waitall (...)... ...269

MPI_Waitall (...)... ...328

MPI_Waitall (...)... ...361

MPI_Allreduce (...) bval3d.F :155 do j=js,je+1156 if (abs(niib23(j,k)) .eq. 1)then157 v2b3(is-1,j,k) = v2b3(is ,j,k)148 v2b3(is-2,j,k) = v2b3(is+1,j,k) Root causeScaling issue

Fig. 12: Backtracking algorithm on the PPG for a Zeus-MP run with 128 processes

D. Case Studies with Real Applications

In this section, we use three real applications, Zeus-MP [33],SST [34], and Nekbone [35], to demonstrate how to diagnosescaling issues with our performance tool. When the root-causes of scaling issues are identiﬁed, we optimize the code toimprove the scalability of these applications. We also analyzethe advantages of our approach over the two state-of-the-arttools HPCToolkit [8] and Scalasca [7].

1) Zeus-MP:

Zeus-MP [33], a computational ﬂuid dy-namics program, implements the simulation of astrophysi-cal phenomena in three spatial dimensions using the MPIprogramming model. Non-blocking point-to-point (P2P) com-munications are used to implement complex inter-processsynchronization. We evaluate its performance with a problemsize of 64 × ×

64 for different numbers of processes rangingfrom 4 to 128. We observe a signiﬁcant scaling loss for 128processes and results show that the speedup is only 55.53 × on 128 processes while 35.40 × on 64 processes (1 process asbaseline). S CAL A NA is then applied to diagnose the problem. Scaling Loss Detection S CAL A NA ﬁrst generates a PPG andthen performs the backtracking algorithm on this graph toidentify the root causes automatically. Figure 12 shows howS CAL A NA diagnoses the scaling issues on the PPG of Zeus-MP by its backtracking algorithm. The vertical axis from topto down represents the control/data ﬂow, and the horizontalaxis represents different parallel processes. The small pointsrepresent the vertices of the PPG with normal performancewhile the circle points represent problematic vertices with theabnormal performance for the same code snippets. The arrowsshow the backtracking paths based on intra- and inter-processdependence.In detail, the MPI_Allreduce at nudt.F: 361 is detectedas a scaling issue due to its poor scalability for its executiontime. As shown in Figure 12, the dark red (darkest color)lines track backward from the abnormal MPI_Allreduce vertices, then go through the intra-process dependence ofcontrol/data ﬂow and inter-process dependence of P2P com-munications at nudt.F: 328, 269, 227 . The red (lighter color)and orange (lightest color) lines indicate similar backtrackingpaths. Finally, the

LOOP vertices at bval3d.F: 155 (top row inFigure 12) are identiﬁed as the root causes of scaling issues. We ﬁnd that the underlying reason is that only some busyprocesses execute the

LOOP at bval3d.F: 155 while the othersare idle with non-blocking P2P communications at nudt.F:227 . Delays in these processes can propagate through thenon-blocking P2P communications at nudt.F: 269 and nudt.F:328 . The MPI_Allreduce at nudt.F: 361 synchronizes allprocesses and leads to the low performance of Zeus-MP. Optimization

To ﬁx the performance issue identiﬁed byS

CAL A NA , we change the program into a hybrid programmingmodel with MPI plus OpenMP, by adding multi-thread supportat the LOOP of bval3d.F: 155 , which can accelerate the busyprocesses and mitigate the latent load imbalance betweenbusy processes and idle processes. Similarly, S CAL A NA alsodetects other root causes of the scaling loss from the LOOP s at hsmoc.F: 665, 841, 1,041 . S

CAL A NA shows that the load/storeinstruction count and the cache miss count recorded by thePMU (Performance Monitor Unit) stays high with increasingnumbers of processes. We use the techniques of loop tilingand scalar promotion to reduce the cache miss and memoryaccess. With these optimizations, the speedup of Zeus-MPis increased from 55.53 × to 61.39 × (1 process as baseline)on 128 processes and a 9.55% performance improvement isachieved on Gorgon.We also test the optimized performance of Zeus-MP with alarge process number. The speedup of Zeus-MP is increasedfrom 68.41 × to 76.15 × (16 processes as baseline) on 2,048processes and 9.96% performance improvement is achievedon Tianhe-2 supercomputer. Note that more optimization tech-niques can be further explored for Zeus-MP, but we only givesome common optimizations here to verify the performancebottlenecks detected by S CAL A NA . Comparison

As for other state-of-the-art tools, Scalasca canaccurately detect the root causes at function-level when thenumber of processes increases to 64 with some human inter-vention. The proﬁling-based HPCToolkit can automatically de-tect the ﬁne-grained loop-level scaling issues. Speciﬁcally, the

MPI_Allreduce at nudt.F: 361 and the LOOP at bval3d.F:155 can be detected as scalability bottlenecks in HPCToolkit.However, proﬁling-based HPCToolkit cannot easily identifythe root cause problem ( LOOP at bval3d.F: 155 ) withoutsigniﬁcant human efforts. The outputs from HPCToolkit willshow multiple bottlenecks without analysis on their underlyingelationship to infer which one is the actual root cause.Figure 13 shows the performance and storage analysisof S CAL A NA against the state-of-the-art Scalasca and HPC-Toolkit. The lower is better for both Figure 13(a) and 13(b).As for performance, both S CAL A NA and HPCToolkit have anegligible runtime overhead by 1.85% and 2.01% on average,respectively. However, the tracing-based Scalasca introduces40.89% runtime overhead on 64 processes (without I/O) togenerate traces. For storage, our light-weight S CAL A NA is bet-ter than Scalasca. S CAL A NA only needs 20MB storage spacewhile Scalasca generates 28.26GB traces of 64 processes. (a) Runtime overhead (b) Storage cost Fig. 13: Runtime and storage overhead of Scalasca [7], HPC-Toolkit [8], and S

CAL A NA when running Zeus-MP (Scalascadetects the root cause when the number of processes increasesto 64.)

2) SST:

SST (Structural Simulation Toolkit) [34] is a multi-process simulation framework, which simulates for microar-chitecture and memory in highly concurrent systems. Weexecute SST for different numbers of processes ranging from4 to 128, and results show that the speedup is only 1.20 × on 32 processes while 1.28 × on 16 processes (4 processes asbaseline). We notice that the dependence of simulated eventsin SST is usually complex so that most events need to beexecuted sequentially. The parallelism only occurs within eachevent in most cases, causing relatively low speedup for 32processes. We use S CAL A NA to analyze the scaling loss ofSST. Scaling Loss Detection S CAL A NA ﬁnds that the scal-ing loss mainly comes from the MPI_Allreduce in the

RankSyncSerialSkip::exchange function at rankSyncSerial-Skip.cc:235 . As shown in Figure 14, after backward trackingthrough P2P communications

MPI_Waitall in the function

RankSyncSerialSkip::exchange at rankSyncSerialSkip.cc:217 ,the LOOP in the function

RequestGenCPU::handleEvent at mirandaCPU.cc:247 is identiﬁed as the root cause of scalingissues. The colored lines show some backtracking paths asexamples. Optimization

As shown in Figure 15, S

CAL A NA providesthe PMU data showing that the total instruction counts( TOT_INS ) for different processes differ a lot in this loop.Based on the results of S

CAL A NA , we ﬁnd that this programuses an inefﬁcient data structure ( array ) to process each queryin a critical path for each process, which can cause differentexecution time ( TOT_INS ) to traverse the array for differentprocesses. We modify the code and change the data structure mirandaCPU.cc:247for(uint32_t i = 0; i < pendingRequests.size(); ++i) {pendingRequests.at(i)->satisfyDependency(cpuReq->getOriginalReqID());}for(uint32_t i = 0; i < pendingRequests.size(); ++i) {auto id = cpuReq->getOriginalReqID();for (auto req : callbacks[id]) {req->satisfyDependency(id);}} source code of the root causeoptimization rankSyncSerialSkip.cc:217 MPI_Waitall () rankSyncSerialSkip.cc:235 MPI_Allreduce ()

Fig. 14: Backtracking algorithm on PPG and code optimizationfor an SST run with 32 processes

Fig. 15: PMU data for SST running with 32 processesfrom array to unordered map , which reduces the complexityof the query algorithm from O(n) to O( log (n)) and makesthe load (execution time of query) of different processesmore balanced. Figure 15 also shows TOT_INS counts ofdifferent processes after our optimization, which are morebalanced among different processes than the original SST.After the optimization, the speedup of SST for 32 processesis increased from 1.20 × to 1.56 × (4 processes as baseline)and the performance is improved by 73.12%. Comparison

The state-of-the-art proﬁling tool HPCToolkitonly locates that

MPI_Waitall is a scalability bottleneck butnot the

LOOP in the function

RequestGenCPU::handleEvent because it does not proﬁle on threads created at runtime,although its method is able to proﬁle the threads theoretically.Even if it can do proﬁling on threads, the root cause identi-ﬁcation still needs more human analysis. Besides, S

CAL A NA provides the PMU data of the root causes, which makes itpossible to analyze on an architecture level for developers. Forstorage, S CAL A NA only needs 1.03MB storage space whileScalasca needs 31.56GB to store the generated traces of 32processes.

3) Nekbone:

Nekbone, the basic structure of Nek5000 [35],uses a spectral element method to solve the Helmholtz equa-tion in three-dimensional space. We execute Nekbone at thescale of 16,384 elements for the number of processes rangingfrom 4 to 128. Nekbone encounters a scaling issue whenrunning on 64 processes. The speedup is only 31.95 × for64 processes while the speedup of 32 processes is 20.61 × (1 process as baseline). caling Loss Detection We use S

CAL A NA to analyze theroot cause of the scalability problem. MPI_Waitall in thefunction of comm wait at comm.h:243 is detected as a non-scalable vertex. Using the backtracking algorithm on the PPGthrough inter-process dependence, S CAL A NA ﬁnds that theroot cause of the scaling loss is the LOOP in the function of dgemm at blas.f:8,941 . In this loop, some processes consumesigniﬁcantly less time than others, which causes the waitingtime of MPI_Waitall to increase and ﬁnally leads to thepoor scalability of Nekbone.

Fig. 16: PMU data for Nekbone running with 32 processes

Optimization

As shown in Figure 16, the PMU data providedby S

CAL A NA shows that the load/store instruction count( TOT_LST_INS ) of this loop is the same among processeswhile the total cycle count (

TOT_CYC ) of the loop differs.We ﬁnd that the memory access speed of each processor corediffers, and the processes are bound to different processorcores. From the perspective of the code, we optimize it byusing a more efﬁcient linear algebra library (BLAS) to reducethe number of

TOT_LST_INS and mitigate the time varianceamong processes. Figure 16 also shows that

TOT_LST_INS decreases by 89.78%, and the execution time variance amongdifferent processes is reduced by 94.03%. After the optimiza-tion, the speedup on 64 processes is increased from 31.95 × to 51.96 × (1 process as baseline) and the performance isimproved by 68.95%.We also analyze the optimized performance of Nekbonewith a large process number. The speedup on 2,048 processesis increased from 27.08 × to 29.97 × (64 processes as baseline)and 11.11% performance improvement is achieved on Tianhe-2 supercomputer. Comparison

For HPCToolkit, the

MPI_Waitall at comm.h:243 , the LOOP at blas.f: 8,941 , and some other points aredetected as potential bottlenecks, but further manual analysisis needed to ﬁnd the root cause. For storage, S CAL A NA onlyneeds 0.32MB storage space while Scalasca needs 3.44GB tostore the generated traces of 64 processes.VII. R ELATED W ORK

Mohr [36] gives a comprehensive survey of state-of-the-art performance analysis tools including both tracing- andproﬁling-based methods. Knobloch et al. [37] present a suf-ﬁcient survey of performance tools for heterogeneous HPCapplications. In the remaining part of this section, we discussrepresentative related work for performance analysis in detail.

Tracing

Traces are widely used for analyzing programbehavior. Intel provides a trace collection tool to understand MPI program’s behavior [12]. Based on Score-P infrastruc-ture [38], [39], TAU [40], [41], Vampir [24], [42], [43], [44],Scalasca [7], [45], and some state-of-the-art tools supportvarious programming models, such as MPI, OpenMP, Pthread,and CUDA. These tools can visualize trace data and provideﬁne-grained performance analysis for developers. Paraver [46],[47], [48] is a tracing-based performance analyzer that sup-ports ﬂexible data collection and detailed analysis of metricsvariability. Becker et al. [49] use event traces to analyze theperformance for large-scale programs. Though many works fortrace compression are proposed [50], [51], [52], [53], tracingstill often brings very large overhead which makes it non-suitable for production environments.

Proﬁling

Proﬁling can extract the program’s statistical in-formation with very low overhead. mpiP [9] is a light-weight proﬁling library for MPI applications, which can collectstatistical information for MPI functions with low overhead.Tallent et al. [10], [11] use call path proﬁling to identify andqualify the load imbalance for parallel programs. STAT [54]performs large scale debugging by sampling stack trace toassemble a proﬁle for applications’ behavior. HPCToolkit [8]uses sample-based techniques to get the proﬁle performance ofapplications and visualize the results with hpcviewer and hpc-traceviewer . Arm MAP [55] is a light-weight proﬁler, which isavailable as a part of Arm Forge debug and proﬁle suite. Craydevelops CrayPat [56], supporting both tracing and proﬁlingperformance analysis, for XC platforms. However, proﬁlingoften misses important information which may prevent us fromcorrectly understanding the program’s behavior.Our approach uses proﬁling to collect dynamic statisticalinformation, while combining it with static extracted programstructure, so that we can achieve high accuracy with lowoverhead.

Program structure based program analysis

Cypress [50]and Spindle [57] use hybrid static-dynamic analysis for com-munication trace compression and memory access monitoring.By extracting the program structure at compilation time,the runtime overhead can be signiﬁcantly reduced. Weberet al. [58] presents effective structural similarity measure toclassify and store the data for parallel programs. Programstructure is also used for large scale debugging [59], [60],[61], [62], since program structure contains the dependencefor both inter- and intra-process, which play an important rolein large scale debugging.

Detecting scalability bottlenecks

Coarfa et al. [63] identifythe scalability bottlenecks by analyzing HPCToolkit’s [8]hpcviewer data with a top down approach. However, it can-not deal with some communication patterns with complexdependence. Bohme et al. [64] use runtime trace to identifythe root cause of wait states. As a tracing-based approach,Bohme’s work performs a forward and backward trace replayon collected timeline traces. With the complete traces, delay orroot causes can be accurately identiﬁed. Inspired by Bohme’sbackward-replay analysis, we propose a backtracking rootcause detection algorithm in S

CAL A NA . Instead of record-ing a large amount of traces, our approach works on therogram structure based PPG, which contains little proﬁlingdata. Therefore, S CAL A NA introduces very low storage costand detection overhead. Barnes et al. [30] use regression-based approaches to perform scalability prediction. Calotoiu etal. [18] automate traditional performance modeling to detectscalability bugs. Bhattacharyya et al. [17] improve it usingcompiler techniques. Chen et al. [65] present a scalable perfor-mance modeling framework based on the concept of critical-path candidates for MPI workloads. ScaAnalyzer [3] collects,attributes, and analyzes memory-related metrics at runtime toidentify the scalability bottlenecks caused by memory accessbehavior for the parallel programs running on a single node.COLAB [66] collects and accumulates futexes from Linuxkernel at runtime to detect bottlenecks caused by programsynchronizations.Our work targets on detecting scalability bottlenecks usingprogram structure combining with runtime proﬁling informa-tion, which helps address the root cause more accurately.VIII. C ONCLUSION

In this paper, we design S

CAL A NA , a light-weight per-formance tool that can efﬁciently detect scalability problemsof parallel programs by combining both static and dynamicanalysis. S CAL A NA uses a novel approach to automaticallyidentify the root cause for complex parallel programs, namedbacktracking root cause detection, through traversing a pro-gram performance graph. We evaluate it with both benchmarksand applications. Results show that S CAL A NA can efﬁcientlyidentify the scalability bottlenecks with very low overhead andoutperform state-of-the-art approaches.R EFERENCES[1] “top500 website,” 2020. [Online]. Available: http://top500.org/[2] J. Y. Shi, M. Taiﬁ, A. Pradeep, A. Khreishah, and V. Antony, “Pro-gram scalability analysis for hpc cloud: Applying amdahl’s law to nasbenchmarks,” in . IEEE, 2012, pp. 1215–1225.[3] X. Liu and B. Wu, “Scaanalyzer: A tool to identify memory scalabilitybottlenecks in parallel programs,” in

Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis . ACM, 2015, p. 47.[4] O. Pearce, H. Ahmed, R. W. Larsen, P. Pirkelbauer, and D. F. Richards,“Exploring dynamic load imbalance solutions with the comd proxyapplication,”

Future Generation Computer Systems , vol. 92, pp. 920–932, 2019.[5] D. Schmidl, M. S. M¨uller, and C. Bischof, “Openmp scalability limitson large smps and how to extend them,” Fachgruppe Informatik, Tech.Rep., 2016.[6] D. Bailey, T. Harris, W. Saphir, R. V. D. Wijngaart, A. Woo, andM. Yarrow,

The NAS Parallel Benchmarks 2.0 . Moffett Field, CA:NAS Systems Division, NASA Ames Research Center, 1995.[7] M. Geimer, F. Wolf, B. J. Wylie, E. ´Abrah´am, D. Becker, and B. Mohr,“The scalasca performance toolset architecture,”

Concurrency and Com-putation: Practice and Experience , vol. 22, no. 6, pp. 702–719, 2010.[8] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance anal-ysis of optimized parallel programs,”

Concurrency and Computation:Practice and Experience , vol. 22, no. 6, pp. 685–701, 2010.[9] J. Vetter and C. Chambreau, “mpip: Lightweight, scalable mpi proﬁling,”2005. [10] N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey, “Scalable identi-ﬁcation of load imbalance in parallel executions using call path proﬁles,”in

Proceedings of the 2010 ACM/IEEE International Conference forHigh Performance Computing, Networking, Storage and Analysis . IEEEComputer Society, 2010, pp. 1–11.[11] N. R. Tallent, J. M. Mellor-Crummey, L. Adhianto, M. W. Fagan, andM. Krentel, “Diagnosing performance bottlenecks in emerging petascaleapplications,” in

Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis . IEEE, 2009, pp. 1–11.[12] “Intel trace analyzer and collector.” [Online]. Available: https://software.intel.com/en-us/trace-analyzer[13] J. Zhai, W. Chen, and W. Zheng, “Phantom: predicting performanceof parallel applications on large-scale parallel machines using a singlenode,” in

ACM Sigplan Notices , vol. 45, no. 5. ACM, 2010, pp. 305–314.[14] J. C. Linford, S. Khuvis, S. Shende, A. Malony, N. Imam, and M. G.Venkata, “Performance analysis of openshmem applications with taucommander,” in

Workshop on OpenSHMEM and Related Technologies .Springer, 2017, pp. 161–179.[15] H. Yin, Z. Hu, X. Zhou, H. Wang, K. Zheng, Q. V. H. Nguyen, andS. Sadiq, “Discovering interpretable geo-social communities for userbehavior prediction,” in . IEEE, 2016, pp. 942–953.[16] H. Yin, B. Cui, X. Zhou, W. Wang, Z. Huang, and S. Sadiq, “Jointmodeling of user check-in behaviors for real-time point-of-interestrecommendation,”

ACM Transactions on Information Systems , vol. 35,no. 2, p. 11, 2016.[17] A. Bhattacharyya, G. Kwasniewski, and T. Hoeﬂer, “Using CompilerTechniques to Improve Automatic Performance Modeling.” ACM, Oct.2015, in proceedings of the 24th International Conference on ParallelArchitectures and Compilation.[18] A. Calotoiu, T. Hoeﬂer, M. Poke, and F. Wolf, “Using automatedperformance modeling to ﬁnd scalability bugs in complex codes,” in

Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis . ACM, 2013, p. 45.[19] F. Wolf, C. Bischof, A. Calotoiu, T. Hoeﬂer, C. Iwainsky, G. Kwas-niewski, B. Mohr, S. Shudler, A. Strube, A. Vogel et al. , “Automaticperformance modeling of hpc applications,” in

Software for ExascaleComputing-SPPEXA 2013-2015 . Springer, 2016, pp. 445–465.[20] D. Beckingsale, O. Pearce, I. Laguna, and T. Gamblin, “Apollo: Reusablemodels for fast, dynamic tuning of input-dependent code,” in .IEEE, 2017, pp. 307–316.[21] J. C. Linford, J. Michalakes, M. Vachharajani, and A. Sandu, “Multi-core acceleration of chemical kinetics for simulation and prediction,”in

Proceedings of the Conference on High Performance ComputingNetworking, Storage and Analysis , 2009, pp. 1–11.[22] A. Calotoiu, D. Beckinsale, C. W. Earl, T. Hoeﬂer, I. Karlin, M. Schulz,and F. Wolf, “Fast multi-parameter performance modeling,” in , Sep. 2016, pp.172–181.[23] “The LLVM compiler framework.” [Online]. Available: http://llvm.org[24] W. E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach,“Vampir: Visualization and analysis of mpi resources,” 1996.[25] “PAPI tools.” [Online]. Available: http://icl.utk.edu/papi/software/[26] X. Wu and F. Mueller, “Scalaextrap: Trace-based communication extrap-olation for spmd programs,” in

ACM SIGPLAN Notices , vol. 46, no. 8.ACM, 2011, pp. 113–122.[27] M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. De Supinski,“Scalatrace: Scalable compression and replay of communication tracesfor high-performance computing,”

Journal of Parallel and DistributedComputing , vol. 69, no. 8, pp. 696–710, 2009.[28] J. Vetter, “Dynamic statistical proﬁling of communication activity indistributed applications,”

ACM SIGMETRICS Performance EvaluationReview , vol. 30, no. 1, pp. 240–250, 2002.[29] B. Mohr,

PMPI Tools . Boston, MA: Springer US, 2011, pp. 1570–1575.[Online]. Available: https://doi.org/10.1007/978-0-387-09766-4 57[30] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. De Supinski,and M. Schulz, “A regression-based approach to scalability prediction,”in

Proceedings of the 22nd annual international conference on Super-computing . ACM, 2008, pp. 368–377.[31] X. Tang, J. Zhai, X. Qian, B. He, W. Xue, and W. Chen, “vsensor:leveraging ﬁxed-workload snippets of programs for performance vari-nce detection,” in

ACM SIGPLAN Notices , vol. 53, no. 1. ACM, 2018,pp. 124–136.[32] D. Terpstra, H. Jagode, H. You, and J. Dongarra, “Collecting perfor-mance data with papi-c,” in

Tools for High Performance Computing2009 . Springer, 2010, pp. 157–173.[33] J. C. Hayes, M. L. Norman, R. A. Fiedler, J. O. Bordner, P. S. Li, S. E.Clark, M.-M. Mac Low et al. , “Simulating radiating and magnetizedﬂows in multiple dimensions with zeus-mp,”

The Astrophysical JournalSupplement Series , vol. 165, no. 1, p. 188, 2006.[34] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldﬁeld,M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis et al. ,“The structural simulation toolkit,”

ACM SIGMETRICS PerformanceEvaluation Review , vol. 38, no. 4, pp. 37–42, 2011.[35] P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier, “nek5000 web page,”2008.[36] B. Mohr, “Scalable parallel performance measurement and analysistools-state-of-the-art and future challenges,”

Supercomputing frontiersand innovations , vol. 1, no. 2, pp. 108–123, 2014.[37] M. Knobloch and B. Mohr, “Tools for gpu computing–debugging andperformance analysis of heterogenous hpc applications,”

Supercomput-ing Frontiers and Innovations , vol. 7, no. 1, pp. 91–111, 2020.[38] D. an Mey, S. Biersdorf, C. Bischof, K. Diethelm, D. Eschweiler,M. Gerndt, A. Kn¨upfer, D. Lorenz, A. Malony, W. E. Nagel et al. ,“Score-p: A uniﬁed performance measurement system for petascaleapplications,” in

Competence in High Performance Computing 2010

The International Journal of High Performance Computing Applications ,vol. 20, no. 2, pp. 287–311, 2006.[41] “Tau homepage. university of oregon.” [Online]. Available: http://tau.uoregon.edu[42] M. S. M¨uller, A. Kn¨upfer, M. Jurenz, M. Lieber, H. Brunst, H. Mix,and W. E. Nagel, “Developing scalable applications with vampir, vam-pirserver and vampirtrace.” in

PARCO , vol. 15. Citeseer, 2007, pp.637–644.[43] A. Kn¨upfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler,M. S. M¨uller, and W. E. Nagel, “The vampir performance analysis tool-set,” in

Tools for high performance computing

European Conferenceon Parallel Processing . Springer, 1996, pp. 665–674.[47] H. Servat, G. Llort, J. Gim´enez, and J. Labarta, “Detailed performanceanalysis using coarse grain sampling,” in

European Conference onParallel Processing . IEEE, 2007, pp. 1–10.[50] J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen, “Cypress: combining staticand dynamic analysis for top-down communication trace compression,”in

SC’14: Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis . IEEE, 2014, pp.143–153.[51] M. Noeth, F. Mueller, M. Schulz, and B. R. De Supinski, “Scalablecompression and replay of communication traces in massively p arallele nvironments,” in . IEEE, 2007, pp. 1–11.[52] S. Krishnamoorthy and K. Agarwal, “Scalable communication tracecompression,” in

Proceedings of the 2010 10th IEEE/ACM InternationalConference on Cluster, Cloud and Grid Computing . IEEE ComputerSociety, 2010, pp. 408–417.[53] A. Knupfer and W. E. Nagel, “Construction and compression of com-plete call graphs for post-mortem program trace analysis,” in . IEEE, 2005, pp. 165–172.[54] D. C. Arnold, D. H. Ahn, B. R. De Supinski, G. L. Lee, B. P. Miller,and M. Schulz, “Stack trace analysis for large scale debugging,” in .IEEE, 2007, pp. 1–10.[55] C. January, J. Byrd, X. Or´o, and M. OConnor, “Allinea map: Addingenergy and openmp proﬁling without increasing overhead,” in

Tools forHigh Performance Computing 2014 . Springer, 2015, pp. 25–35.[56] S. Kaufmann and B. Homer, “Craypat-cray x1 performance analysistool,”

Cray User Group (May 2003) , 2003.[57] H. Wang, J. Zhai, X. Tang, B. Yu, X. Ma, and W. Chen, “Spindle:informed memory access monitoring,” in , 2018, pp. 561–574.[58] M. Weber, R. Brendel, T. Hilbrich, K. Mohror, M. Schulz, and H. Brunst,“Structural clustering: a new approach to support performance analysis atscale,” in . IEEE, 2016, pp. 484–493.[59] I. Laguna, D. H. Ahn, B. R. de Supinski, T. Gamblin, G. L. Lee,M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen et al. , “Debugginghigh-performance computing applications at massive scales,”

Communi-cations of the ACM , vol. 58, no. 9, pp. 72–81, 2015.[60] B. Zhou, M. Kulkarni, and S. Bagchi, “Vrisha: using scaling propertiesof parallel programs for bug detection and localization,” in

Proceedingsof the 20th international symposium on High performance distributedcomputing . ACM, 2011, pp. 85–96.[61] I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky,D. H. Anh, M. Schulz, and B. Rountree, “Large scale debugging ofparallel tasks with automaded,” in

Proceedings of 2011 InternationalConference for High Performance Computing, Networking, Storage andAnalysis . ACM, 2011, p. 50.[62] S. Mitra, I. Laguna, D. H. Ahn, S. Bagchi, M. Schulz, and T. Gamblin,“Accurate application progress analysis for large-scale parallel debug-ging,” in

ACM SIGPLAN Notices , vol. 49, no. 6. ACM, 2014, pp.193–203.[63] C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko, “Scalabilityanalysis of spmd codes using expectations,” in

Proceedings of the 21stannual international conference on Supercomputing . ACM, 2007, pp.13–22.[64] D. Bohme, M. Geimer, F. Wolf, and L. Arnold, “Identifying the rootcauses of wait states in large-scale parallel applications,” in . IEEE, 2010, pp. 90–100.[65] J. Chen and R. M. Clapp, “Critical-path candidates: Scalable per-formance modeling for mpi workloads,” in . IEEE,2015, pp. 1–10.[66] T. Yu, P. Petoumenos, V. Janjic, H. Leather, and J. Thomson, “Colab:a collaborative multi-factor scheduler for asymmetric multicore proces-sors,” in