[PDF] Offsite Autotuning Approach -- Performance Model Driven Autotuning Applied to Parallel Explicit ODE Methods

Abstract

Autotuning techniques are a promising approach to minimize the otherwise tedious manual effort of optimizing scientific applications for a specific target platform. Ideally, an autotuning approach is capable of reliably identifying the most efficient implementation variant(s) for a new target system or new characteristics of the input by applying suitable program transformations and analytic models. In this work, we introduce Offsite, an offline autotuning approach which automates this selection process at installation time by rating implementation variants based on an analytic performance model without requiring time-consuming runtime experiments. From abstract multilevel YAML description languages, Offsite automatically derives optimized, platform-specific and problem-specific code of possible implementation variants and applies the performance model to these implementation variants. We apply Offsite to parallel numerical methods for ordinary differential equations (ODEs). In particular, we investigate tuning a specific class of explicit ODE solvers (PIRK methods) for various initial value problems (IVPs) on shared-memory systems. Our experiments demonstrate that Offsite is able to reliably identify a set of the most efficient implementation variants for given test configurations (ODE solver, IVP, platform) and is capable of effectively handling important autotuning scenarios.

Full PDF

aa r X i v : . [ c s . PF ] A p r Oﬀsite Autotuning Approach

Performance Model Driven Autotuning Applied toParallel Explicit ODE Methods

Johannes Seiferth, Matthias Korch, Thomas Rauber

Department of Computer Science, University of Bayreuth, Bayreuth, Germany { johannes.seiferth, korch, rauber } @uni-bayreuth.de Abstract.

Autotuning (AT) is a promising concept to minimize the of-ten tedious manual eﬀort of optimizing scientiﬁc application for a speciﬁctarget platform. Ideally, an AT approach can reliably identify the mosteﬃcient implementation variant(s) for a new platform or new character-istics of the input by applying suitable program transformations and an-alytic models. In this work, we introduce Oﬀsite, an oﬀline AT approachwhich automates this selection process at installation time by rating im-plementation variants based on an analytic performance model withoutrequiring time-consuming runtime tests. From abstract multilevel YAMLdescription languages, Oﬀsite automatically derives optimized, platform-speciﬁc and problem-speciﬁc code of possible variants and applies theperformance model to these variants.We apply Oﬀsite to parallel numerical methods for ordinary diﬀerentialequations (ODEs). In particular, we investigate tuning a speciﬁc class ofexplicit ODE solvers (PIRK methods) for four diﬀerent initial value prob-lems (IVPs) on three diﬀerent shared-memory systems. Our experimentsdemonstrate that Oﬀsite can reliably identify the set of most eﬃcient im-plementation variants for diﬀerent given test conﬁgurations (ODE solver,IVP, platform) and eﬀectively handle important AT scenarios.

Keywords:

Autotuning · performance modeling · description language · ODE methods · ECM performance model · shared-memory The performance of scientiﬁc applications strongly depends on the character-istics of the targeted computing platform, such as, e.g., the processor design,the core topology, the cache architectures, the memory latency or the memorybandwidth. Facing the growing diversity and complexity of today’s computinglandscape, the task of writing and maintaining highly eﬃcient application codeis getting more and more cumbersome for software developers. A highly opti-mized implementation variant on one target platform, might, however, performpoorly on another platform. That particular poorly performant implementationvariant, though, could again potentially outperform all other variants on thenext platform. Hence, in order to achieve a high eﬃciency and obtain optimalperformance when migrating an existing scientiﬁc application, developers needto tune and adapt the application code for each speciﬁc platform anew.

Johannes Seiferth, Matthias Korch, Thomas Rauber

A promising concept to avoid this time-consuming, manual eﬀort is autotun-ing (AT), and many diﬀerent approaches have been proposed to automaticallytune software [2]. AT is based on two core concepts: (i) the generation of opti-mized implementation variants based on program transformation and optimiza-tion techniques such as, e.g., loop unrolling or loop tiling, and (ii) the selectionof the most eﬃcient variant(s) on the target platform from the set of generatedvariants. In general, there are (i) oﬀline and (ii) online

AT techniques. Oﬀline AT tries to select the supposedly most eﬃcient variant at compile or instal-lation time without actual knowledge of the input data. These approaches areapplicable for use-cases, whose execution behavior does no depend on the inputdata. This is the case, e.g., for dense linear algebra problems, which can, i.a., betuned oﬀline with

ATLAS [20] and

PhiPAC [4]. In other ﬁelds, such as sparselinear algebra or particle codes, characteristics of the input data heavily inﬂu-ence the execution behavior. By choosing the best variant at runtime—when allinput is known—, online AT approaches such as

Active Harmony [19] and

ATF [13] incorporate these inﬂuences.Selecting a suitable implementation variant from a potentially large set ofavailable variants in a time-eﬃcient manner is a big challenge in AT. Varioustechniques and search strategies have been proposed in previous works to meetthis challenge [2]. A straightforward approach is the time-consuming comparisonof variants by runtime tests, possibly steered by a single search strategy, such asan exhaustive search or more sophisticated mathematical optimization methodslike diﬀerential evolution [6], or a combination of multiple search strategies [1].[12] proposes a hierarchical approach that allows the use of individual searchalgorithms for dependent subspaces of the search space. As an alternative toruntime tests, analytic performance models can be applied to either select themost eﬃcient variant or to reduce the number of tests required by ﬁltering outineﬃcient variants beforehand. In general, two categories of performance modelsare distinguished: (i) black box models applying statistical methods and ma-chine learning techniques to observed performance data like hardware metricsor measured runtimes in order to learn to predict performance behavior [16,18],and (ii) white box models such as the

Rooﬂine model [21] or the

ECM per-formance model [9,17] that describe the interaction of hardware and code usingsimpliﬁed machine models. For loop kernels, the

Rooﬂine and the

ECM model can be automatically constructed with the

Kerncraft tool [8].

In this work, we propose

Oﬀsite , an oﬀline AT approach that automaticallyidentiﬁes the most eﬃcient implementation variant(s) during installation timebased on performance predictions. These predictions stem from an analytic per-formance prediction methodology for explicit ODE methods proposed by [15]that uses a combined white and black box model approach based on the ECMmodel. The main contributions of this paper are: ﬀsite – A Performance Model Driven Autotuning Approach 3 for l ← , . . . , s do Y (0) l ← y κ for k ← , . . . , m do for l ← , . . . , s do Y ( k ) l ← y κ + h κ P si =1 a li F ( k − i with F ( k − i ← f ( t κ + c i h κ , Y ( k − i ) y κ +1 ← y κ + h κ P si =1 b i F ( m ) i Listing 1.

Timestep function of a PIRK method. (i)

We develop a novel oﬀline AT approach for shared-memory systems basedon performance modelling. This approach automates the task of generating thepool of possible implementation variants using abstract description languages.For all these variants, our approach can automatically predict their performanceand identify the best variant(s). Further, we integrated a database interfacefor collected performance data which enables the reusability of data and whichallows to include feedback from possible online AT or actual program runs. (ii)

We show how to apply Oﬀsite to an algorithm from numerical analysiswith complex runtime behavior: the parallel solution of IVPs of ODEs. (iii)

We validate the accuracy and eﬃciency of Oﬀsite for diﬀerent test con-ﬁgurations and discuss its applicability to four diﬀerent AT scenarios.

Section 2 details the selected example use-case (PIRK methods) and the corre-sponding testbed. Based on this use-case, Oﬀsite is described in Section 3. InSection 4, we experimentally evaluate Oﬀsite in four diﬀerent AT scenarios andon three diﬀerent target platforms. Section 5 concludes the paper.

Use-Case: PIRK Methods

As example use-case, we investigate parallel iterated Runge-Kutta (PIRK).PIRK methods are part of the general class of explicit ODE methods [10] andsolve an ODE System y ′ ( t ) = f ( t, y ( t )) , y ( t ) = y , y ∈ R n (1)by performing a series of timesteps until the end of the integration interval isreached. In each timestep, a new numerical approximation y κ +1 for the unknownsolution y is determined by an explicit predictor–corrector process in a ﬁxednumber of substeps.PIRK methods are an excellent candidate class for AT. Their complex four-dimensional loop structure (Lst. 1) can be modiﬁed by loop transformationsresulting in a large pool of possible implementation variants whose performancebehavior potentially varies highly depending on: (i) the composition of computa-tions and memory accesses, (ii) the number of stages of the base ODE method, (iii) the characteristics of the ODE system solved, (iv) the target hardware, (v) the compiler and the compiler ﬂags, and (vi) the number of threads started. Johannes Seiferth, Matthias Korch, Thomas Rauber

Table 1.

Characteristics of the test set of IVPs.

IVP

Cusp IC Medakzo Wave1D

Acces distance unlimited limited unlimited limited Computational behavior mixed compute-bound mixed memory-bound In practice, many IVPs are sparse, i.e. only access few components of Y whenevaluating function f (line 4, Lst. 1). A special case of sparse is limited accessdistance d ( f ), where f j only accesses components y j − d ( f ) to y j + d ( f ) . Table 2.

Characteristics of the test set of target platforms.

Name

HSW IVB SKY

Micro-architecture

Haswell EP Ivy-Bridge EP Skylake SP

CPU

Xeon E5-2630 v3 Xeon E5-2660 v2 Xeon Gold 6148

Clock speed

Cores

L1 cache (data)

32 kB 32 kB 32 kB

L2 cache

256 kB 256 kB 1 MB

L3 cache (shared)

20 MB 25 MB 27.5 MB

Cache line size

64 B 64 B 64 B

Instruction throughput per cycle (double precision)

ADD / FMA / MUL 2 / 1 / 1 1 / - / 1 4 / 2 / 4

Compiler icc 19.0.5 icc 19.0.4 icc 19.0.2

Test Set of Initial Value Problems

In our experiments, we consider a broad set of IVPs (Table 1) that exhibitdiﬀerent characteristics: (i) Cusp combines Zeeman’s cusp catastrophe modelfor a threshold-nerve-impulse mechanism with the van der Pol oscillator [7], (ii)IC describes a traversing signal through a chain of N concatenated inverters[3], (iii) Medakzo describes the penetration of radio-labeled antibodies into atissue infected by a tumor [11], and (iv) Wave1D describes the propagation ofdisturbances at a ﬁxed speed in one direction [5]. Test Set of Target Platforms

We conducted our experiments on three diﬀerent shared-memory systems (Table2). For all experiments, the CPU clock was ﬁxed, hyper-threading disabled andthread binding set with

KMP AFFINITY=granularity=ﬁne,compact . All codeswere compiled with the Intel C compiler and compiler ﬂags -O3 , -xAVX and -fno-alias set. ﬀsite – A Performance Model Driven Autotuning Approach 5 T un i n g S ce n a r i o KernelWorking SetsKernel CodeGenerationDerived Impl.VariantsCommunicationCost Benchmarks Kernel predictionImpl. VariantPrediction Impl. VariantRanking Impl. VariantCode Generation DB Online AT

Oﬀsite

Fig. 1.

Workﬂow of the Oﬀsite autotuning approach.

In this work, we introduce the

Oﬀsite oﬀline AT approach on the example ofexplicit ODE methods. Before starting a new Oﬀsite run, the tuning scenario desired, which consists of: (i) the pool of possible implementations and programtransformations, (ii) the ODE base method(s), (iii) the IVP(s), and (iv) thetarget platform, is deﬁned using description languages in the YAML standard .From its input data, Oﬀsite automatically handles the whole tuning workﬂow(Fig. 1). First, Oﬀsite generates optimized, platform-speciﬁc and problem-speciﬁccode for all kernels and derives all possible implementation variants. Applyingan analytic performance prediction methodology, the performance of each kernelis predicted for either (i) a ﬁxed ODE system size n —if speciﬁed by the user orprescribed by the ODE —or (ii) a set of relevant ODE system sizes determinedby a working set model. The performance of a variant is derived by combin-ing the predictions of its kernels and adding an estimate of its synchronizationcosts. Variants are ranked by their performance to identify the most eﬃcientvariant(s). All obtained prediction and ranking data are stored in a database.For the best ranked variants, Oﬀsite generates optimized, platform-speciﬁc andproblem-speciﬁc code. A decisive, yet cumbersome step in AT is generating optimized code. Often,there is a large pool of possible implementation variants, applicable programtransformations (e.g. loop transformations) and tunable parameters (e.g. tilesizes) available. Furthermore, exploiting characteristics of the input data canenable more optimizations (e.g. constant propagation). Writing all variants byhand, however, would be tedious and error-prone and there is demand for au-tomation. In this work, we introduce multilevel description languages to describeimplementations, ODE methods, IVPs and target platforms in an abstract way.Oﬀsite can interpret these languages and automatically derives optimized code. YAML is a data serialization language; https://yaml.org There are scalable ODE systems but also ODEs with a ﬁxed size [7]. Johannes Seiferth, Matthias Korch, Thomas Rauber stages: 4 order: 7 corrector_steps: 6 A: [["0.1130", "-0.0403", "0.0258","-0.0099"], ..., [...]] b: ["0.2205", ...] c: ["0.1130 - 0.0403 + 0.0258 -0.0099", ...] Listing 2.

ODE method descriptionformat on the example of

RadauII A(7) . components: first: 1 size: n-1 code: | (((U_op - %in[j]) * R - (eta * ((%in[j-1] - U_thres) * (%in[j-1] -U_thres) - (%in[j-1] - %in[j] -U_thres) * (%in[j-1] - %in[j] -U_thres)))) / C); constants: - double R = 1.0 - ... Listing 3.

IVP description format on theexample of

InverterChain . The Base ODE Method of a PIRK method is characterized by its Butchertable—i.e., coeﬃcient matrix A , weight vector b , node vector c —and a small setof properties: (i) number of stages s , (ii) order o , (iii) number of corrector steps m . Exploiting these properties, however, can have a large impact on the eﬃciencyof an implementation variant and should be included into the code generationin order to obtain the most eﬃcient code. The i -loop in Listing 4, e.g., mightbe replaceable by a single vector operation for speciﬁc s , or zero entries in theButcher table might allow to save computations.Listing 2 shows the ODE method description format on the example of RadauII A(7) which is a four-stage method with order seven applying six corrector stepsper timestep. To save space, only an excerpt of the Butcher table is shown witha reduced number of digits.

IVPs are described in the IVP description format shown by IC (Lst. 3): (i) components describes the n components of the IVP. Each componentcontains a code YAML block that describes how function evaluation f ( t κ + c i h κ , Y ( k − i ) (l. 4, Lst. 1) will be substituted during code generation whereby %in is a placeholder for the used input vector Y ( k − i . Adjacent componentsthat execute the same computation can be described by a single block whereby ﬁrst denotes the ﬁrst component and size speciﬁes the total number of adjacentcomponents handled by that particular block. (ii) constants deﬁnes IVP-speciﬁc parameters replaced with their actual val-ues during code generation and might possibly enable further code optimizations.In IVP IC , e.g., a multiplication could be saved if the given electrical resistance R equals 1 . Target Platform and Compiler are described using the machine descriptionformat introduced by Kerncraft . Its general structure is tripartite: (i) the exe-cution architecture description, (ii) the cache and memory hierarchy description,and (iii) benchmark results of typical streaming kernels. For example ﬁles, we refer to https://github.com/RRZE-HPC/kerncraft .ﬀsite – A Performance Model Driven Autotuning Approach 7 datastructs: - double b[s] - double F[s][n] - double dy[n] computations: C1: "dy[j] = dy[j] + b[i] * F[i][j]" variants: - name: APRX_ij code: | %PRAGMA nounroll_and_jam %LOOP_START i s %LOOP_START j n %COMP C1 %LOOP_END j %LOOP_END i working sets: { "(s+1)*n+s", "2*n" } - name: APRX_ji code: | %LOOP_START j n %LOOP_START i s unroll %COMP FMA %LOOP_END i %LOOP_END j working sets: { "(s+1)*n+s" } Listing 4.

Kernel template descriptionYAML on the example of

APRX . double F[4][1912]; // s=4; n=161 double dy[1912]; // n=161 for (int j=0; j<161; ++j) { // n=161;unrolled i; replaced b[i] dy[0] += 0.2205 * F[0][j]; dy[1] += 0.3882 * F[1][j]; dy[2] += 0.3288 * F[2][j]; dy[3] += 0.0625 * F[3][j]; } Listing 5.

Code generated for kernel

APRX ji of kernel template

APRX specialized on

Radau II A(7) & n = 161. code: | %COM omp_barrier %LOOP_START k m %KERNEL RHS %COM omp_barrier %KERNEL LC %COM omp_barrier %LOOP_END k %KERNEL RHS %COM omp_barrier %KERNEL APRX %KERNEL UPD Listing 6.

Implementation skeletondescription format on the example of A . void timestep(...) { for (k=0; k<6; ++k) { // m=6 // Code for kernel template RHS // Code for kernel template LC } // Code for kernel template RHS // Kernel APRX for (int j=0; j<161; ++j) { // n=161;unrolled i; replaced b[i] dy[0] += 0.2205 * F[0][j]; dy[1] += 0.3882 * F[1][j]; dy[2] += 0.3288 * F[2][j]; dy[3] += 0.0625 * F[3][j]; } // Code for kernel template UPD } Listing 7.

Code generated for avariant of impl. skeleton A using kernel APRX ji specialized on

Radau II A(7) & n = 161. Implementation Variants of numerical algorithms are abstracted by descrip-tion languages as (i) kernel templates and (ii) implementation skeletons . Kernel Templates deﬁne basic computation kernels and possible variationsof this kernel enabled by program transformations that preserve semantic cor-rectness. Listing 4 shows the kernel template description format on the exampleof

APRX , which covers computation P si =1 b i F ( m ) i (l. 5, Lst. 1): (i) datastructs deﬁnes required data structures. (ii) computations describes the computations covered by a kernel template.Each computation corresponds to a single line of code and has an unique iden-tiﬁer (E.g. C1 in Lst. 4). Computations can contain IVP evaluations which aremarked by keyword %RHS and are replaced by an IVP component during codegeneration (E.g. for IC by line 5 of Lst. 3). Hence, if a kernel template contains %RHS , a separate, specialized kernel version has to be generated for each IVPcomponent. Johannes Seiferth, Matthias Korch, Thomas Rauber (iii) variants contains possible kernels of a kernel template enabled by pro-gram transformations. For each kernel, its workings sets ( working sets ) and itsprogram code ( code ) are speciﬁed. A code block deﬁnes for a kernel its orderof computations and program transformations and using four keywords. Com-putations are speciﬁed by keyword %COMP whose parameter must correspondto one of the identiﬁers deﬁned in the computations block (E.g. C1 in Lst. 4).For-loop statements are deﬁned by %LOOP START and %LOOP END . Theﬁrst parameter of %LOOP START speciﬁes the loop variable name, the secondthe number of loop iterations and an optional third parameter unroll indicatesthat the loop will be unrolled during code generation. Loop-speciﬁc pragmas canbe added using %PRAGMA . Implementation skeletons deﬁne processing orders of kernel templates and re-quired communication points. From skeletons, concrete implementation variantsare derived by replacing its templates with concrete kernel code. Listing 6 showsthe implementation skeleton description format on the example of skeleton A which is a realization of a PIRK method (Lst. 1) that focuses on parallelismacross the ODE system, i.e its n equations are distributed blockwise among thethreads. A contains a loop k over the m corrector steps dividing each correc-tor step into two templates: RHS computes the IVP function evaluations (l. 5,Lst. 1) which are used to compute the linear combinations (l. 4, Lst. 1) in LC .Per corrector step, two synchronizations are needed as RHS —depending on theIVP solved—can potentially require all components of the linear combinationsfrom the last iteration of k . After all corrector steps are computed, the nextapproximation y κ +1 is calculated by templates APRX and

UPD (l. 6, Lst. 1).Four keywords suﬃce to specify skeletons: (i) %LOOP START and %LOOP END deﬁne for-loops. (ii) %COM states communication operations of an implementation skeleton.Skeleton A , e.g., requires 2 m + 2 barrier synchronizations. (iii) %KERNEL speciﬁes an executed kernel template. Its parameter mustcorrespond to the name of a kernel template. During code generation %KERNEL is replaced by actual kernel code (e.g. APRX in Lst. 7).

Oﬀsite can automatically identify the most eﬃcient implementation variant(s)from a pool of available variants using analytic performance modelling (Fig. 1): (i)

In a ﬁrst step, Oﬀsite automatically generates code for all kernels in aspecial code format processable by kerncraft . Kernel code generation ( KernelCode Generation in Fig. 1) includes specializations of the code on the target plat-form, IVP, ODE method and (if ﬁxed) ODE system size n . Listing 5 exemplaryshows the code generated for kernel APRX ji of kernel template

APRX (Lst. 4)when specialized on ODE method

Radau II A(7) and n = 161. As speciﬁed inthe template description, j loop is unrolled completely. Further, Butcher tablecoeﬃcients ( b ) and known constants ( s = 4, n = 161) are substituted. In this work, version of the Kerncraft tool was used.ﬀsite – A Performance Model Driven Autotuning Approach 9 (ii)

In some tuning scenarios, the ODE system size n is not yet known duringinstallation time. Giving predictions for all valid n values, however, is in generalnot feasible. By applying a working set model (Sec. 3.4), Oﬀsite automaticallydetermines for each kernel a set of relevant n ( Kernel Working Sets , Fig. 1) forwhich predictions are then obtained in the next step. (iii)

Oﬀsite automatically computes node-level runtime predictions (Sec. 3.3)for each implementation variant (

Impl. Variant Prediction , Fig. 1) by adding upthe kernel predictions of its kernels and adding an estimate of its communicationcosts (

Communication Cost Benchmarks , Fig. 1), which Oﬀsite derives frombenchmark data. For each of the kernel codes generated in step (i) , its kernelprediction is automatically derived by Oﬀsite (

Kernel Prediction , Fig. 1) wherebyKerncraft is used to construct the ECM model. (iv)

Using these node-level runtime predictions, Oﬀsite ranks implementationvariants by their performance (

Impl. Variant Ranking , Fig. 1). (v)

From the ranking of implementation variants, Oﬀsite automatically de-rives the subset Λ of the best rated variant(s) which contains all variants λ whoseperformance is within an user-provided maximum deviation from the best ratedvariant. For each variant of λ , Oﬀsite generates optimized, platform-speciﬁc andproblem-speciﬁc code ( Impl. Variant Code Generation , Fig. 1). Listing 7 showsan excerpt of the code generated for an variant of implementation skeleton A which substitutes kernel template APRX with kernel

APRX ji and was special-ized on ODE method

Radau II A(7) , IVP IC and n = 161. The performance prediction methodology applied by Oﬀsite expands on theworks of [15] and comprises: (i) a node-level runtime prediction of an imple-mentation variant and (ii) an estimate of its intra-node communication costs.

Node-level Runtime Prediction.

Base of the node-level prediction is theanalytic

ECM (Execution-Cache-Memory) performance model . For an in-depthexplanation, we refer to [17,9]. The ECM model gives an estimation of the num-ber of CPU cycles per cache line (CL) required to execute a particular loopkernel on a multi- or many-core chip which includes contributions from the in-core execution time T core and the data transfer time T data : T core = max( T OL , T nOL ) , (2) T L3data = T dataL1L2 + T dataL2L3 + T pL2L3 . (3) T core is deﬁned as the time required to retire the instructions of a single loopiteration under the assumptions that (i) there are no loop-carried dependen-cies, (ii) all data are in the L1 data cache, (iii) all instructions are scheduledindependently to the ports of the units, and (iv) the time to retire arithmeticinstructions and load/store operations can overlap due to speculative executiondepending on the target platform. Hence, the unit that takes the longest to retire its instructions determines T core . T leveldata factors in the time required to transferall data from its current location in the memory hierarchy to the L1 data cacheand back. The single contributions of transfers between levels i and j of thememory hierarchy T data ij are determined depending on the amount of transferredCLs. Depending on the platform used, an optional latency penalty T pij mightbe added. In (3) T leveldata is exemplarily shown for data coming from the L3 cacheunder the assumption that a latency penalty between L2 and L3 cache has tobe factored in on the platform used. Combining all contributions, a single-coreprediction T levelECM = max( T OL , T nOL + T leveldata ) (4)can be derived, whereby the overlapping capabilities of the target platform de-termine whether a contribution is considered overlapping or non-overlapping.Using Kerncraft, Oﬀsite obtains ECM model predictions (4) for all kernels λ .For each kernel, kernel runtime prediction φ λ = α λ · β λ δ · f (5)yields the runtime in seconds of kernel λ , where α λ is (4) computed for a speciﬁcnumber of running cores τ , β λ is the number of loop iterations executed, δ is thenumber of data elements ﬁtting into one CL and f is the CPU frequency. Bysumming up the individual kernel runtime predictions φ λ of its basic kernels λ and adding an estimate of its communication costs t com , the node-level runtimeprediction θ ǫ of an implementation variant ǫ is given by: θ ǫ = X λ φ λ + t com . (6) Remark: [15] used an older

Kerncraft version that could not yet return ECMpredictions for multiple core counts τ with a single run, but additionally returnedthe kernel’s saturation point σ λ . Hence, an extra factor min( τ, σ λ ) was neededin the denominator of (5) in their work. Estimate of Intra-node Communication Costs.

For the implementationvariants considered in this work, the costs of the required OpenMP barrier oper-ations are estimated depending on the number of threads using linear regression.

Reusability of Performance Predictions.

Prediction data (e.g. kernel run-time predictions) collected for a speciﬁc implementation variant can be reusedto estimate other variants (if they also include that kernel) or to estimate otherIVPs (if the kernel does not contain any IVP evaluations). In the context ofAT, this is a decisive advantage compared to time-consuming runtime testingof variants which requires running each additionally added variant as well or torun all variants anew when changing the IVP solved. ﬀsite – A Performance Model Driven Autotuning Approach 11

If the ODE system size n is not ﬁxed—either by the user or restrictions of theIVP—selecting the most eﬃcient implementation variant(s) at installation timeleads to an exhaustive search over the possibly vast space of values for n . Tominimize the number of predictions required per kernel, the set of estimated n values is reduced by a model-based restriction, the working set of the kernel,which corresponds to the amount of data referenced by a kernel.We use the working sets to identify for each kernel the maximum n that stillﬁt into the single cache levels. Using these maximums, ranges of consecutive n values for which the ECM prediction (4) stays constant can be derived. Themedium values of these ranges form the working set of the kernel. We validate Oﬀsite using the experimental test bed introduced in Section 2. Inparticular, we study the eﬃciency of Oﬀsite in four AT scenarios when tuningfour diﬀerent IVPs on three diﬀerent target platforms and compare the eﬃciencyof the following AT strategies: (i) BestVariant covers the case that the most eﬃcient implementation variantis already known (e.g. from previous execution) and no AT is required. (ii) RunAll runs all variants in order to identify the most eﬃcient variant. (iii) OﬀsitePreselect5 runs an Oﬀsite determined subset of all variants, whichcontains all variants withing a 5 % deviation of the best ranked variant, toidentify the most eﬃcient variant of that subset. (iv) OﬀsitePreselect10 runs variants within a 10 % deviation of the best variant. (v) RandomSelect randomly runs 20 of the total 56 variants.

Table 3 summarizes the implementation skeletons and kernel templates used inthis work. In total, we consider eight skeletons from which 56 implementationvariants can be derived. Each table column shows the templates required by aparticular skeleton. E.g., skeleton A (Lst. 6) uses templates LC , RHS , APRX and

UPD and twelve variants can be derived from A as there are six diﬀerent kernelsof LC (enabled by loop interchanges, unrolls, pragmas) and two of APRX .In total, 17 diﬀerent kernels can be derived from the eight kernel templatesavailable. To predict the performance of all 56 variants, only these 17 kernelshave to be estimated. Further, when obtaining predictions oﬀ all 56 variants fora diﬀerent IVP, only those four templates that contain IVP evaluations—andthus their six corresponding kernels—need to be re-evaluated, while predictiondata of the remaining kernels can be retrieved from database. The ECM prediction factors in the location of data in the memory hierarchy. As asimpliﬁed assumption—neglecting overlapping eﬀects at cache borders—, this meansthat as long as data locations do not change, the ECM model yields the same valuefor a kernel independent from the actual n .2 Johannes Seiferth, Matthias Korch, Thomas Rauber Table 3.

Overview of the implementation variants considered.

Impl. Skeleton Kernel Template ( ( A kernel template marked with * contains evaluations of the IVP. As ﬁrst test scenario, we consider the case that all input is known at installationtime, in particular the ODE system size n . In such cases, Oﬀsite is applied with-out the working set model. Performance predictions, however, are only obtainedfor that particular n and a new Oﬀsite AT run would be required if n changes.Table 4 compares the accuracy and eﬃciency of AT strategies when tuningfour diﬀerent IVPs on three diﬀerent target platforms for n = 36 , ,

000 andODE method

Radau II A(7) . For AT strategies

OﬀsitePreselect5 and

OﬀsitePre-select10 , t step yields the time in seconds it takes to execute a timestep using themeasured best implementation variant from the subset Λ of variants λ tested bythat strategy. Performance loss denotes the percent runtime deviation of thatparticular measured best variant from the variant selected by

BestVariant ( t best ).Ideally, an AT strategy correctly identiﬁes the measured best variant and, thus,would suﬀer no performance loss. For an AT strategy, | Λ | yields the cardinalityof subset Λ and the percent tuning overhead of applying that strategy is deﬁnedas t tune −| Λ | t best | Λ | t best ·

100 where t tune = P Λλ t λ is the time required to test all variantsand | Λ | t best is the time needed to execute the measured best variant instead. Haswell.

AT strategy

RunAll causes a signiﬁcant tuning overhead for all IVPs,while

OﬀsitePreselect5 and

OﬀsitePreselect10 only lead to marginal overheadas the subset of tested variants is considerably smaller, while still being able toselect the measured best variant for all IVPs but

Wave1D . IvyBridge.

Again,

RunAll leads to decisive overhead compared to both Oﬀ-site strategies and the measured best variant is correctly identiﬁed for all IVPs.However, for IVP IC only OﬀsitePreselect10 ﬁnds the best variant. As IC iscompute-bound (Table 1), the IVP evaluation dominates the computation timewhile the order of the remaining computations has only minor impact. Hence,already minor jitter can lead to a diﬀerent variant being selected. Skylake.

Similar observations as on the two previous systems can be made.The overhead of both Oﬀsite strategies is marginal compared to

RunAll . For allIVPs, the measured best variant is successfully identiﬁed. ﬀsite – A Performance Model Driven Autotuning Approach 13

Table 4.

Comparison of diﬀerent AT strategies applied to four diﬀerent IVPs with n = 36 , ,

000 and

Radau II A(7) . IVP Cusp IC Medakzo Wave1D H a s w e ll ( c o r e s ) BestVariant

F *ji F *ij F *ji H LCjli *ij

BestVariant t step [s] AT strategy – OﬀsitePreselect5 | Λ | (tuning overhead) 3 (1%) 3 (3%) 2 (2%) 3 (5%) t step [s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%) AT strategy – OﬀsitePreselect10 | Λ | (tuning overhead) 3 (1%) 3 (3%) 3 (1%) 4 (5%) t step [s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%) AT strategy – RunAll | Λ | (tuning overhead) 56 (42%) 56 (44%) 56 (20%) 56 (16%) I vy B r i d g e ( c o r e s ) BestVariant

E *ji F *ij F *ji F *ji

BestVariant t step [s] AT strategy – OﬀsitePreselect5 | Λ | (tuning overhead) 3 (3%) 1 (1%) 3 (1%) 3 (0.4%) t step [s] selected variant (perf. loss) 1.16 (–) 0.734 (1%) 1.20 (–) 1.04 (–) AT strategy – OﬀsitePreselect10 | Λ | (tuning overhead) 3 (3%) 3 (3%) 3 (1%) 3 (0.4%) t step [s] selected variant (perf. loss) 1.16 (–) 0.725 (–) 1.20 (–) 1.04 (–) AT strategy – RunAll | Λ | (tuning overhead) 56 (54%) 56 (59%) 56 (41%) 56 (44%) S ky l a k e ( c o r e s ) BestVariant

E *ji F *ji F *ji F *ji

BestVariant t step [s] AT strategy – OﬀsitePreselect5 | Λ | (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (%) t step [s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–) AT strategy – OﬀsitePreselect10 | Λ | (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (1%) t step [s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–) AT strategy – RunAll | Λ | (tuning overhead) 56 (69%) 56 (65%) 56 (41%) 56 (44%) The next scenario considered is that of a still unknown ODE system size n atinstallation time. In these cases, the working set model is applied to determinea set of sample n values for which Oﬀsite computes predictions and from whichpredictions for the whole range of possible n are derived. As this requires com-puting multiple performance predictions, a single Oﬀsite run takes longer thanin the previous scenario. This particular Oﬀsite run, however, already covers allpossible n and no further run will be required when switching n at a later point.Figures 2 and 3 show for the single implementation variants selected as bestvariant by the AT strategies considered, the time per timestep of IC and Cusp on three platforms (each using their max. number of cores). On the x-axis, n is · . . . . · − n T i m e p e r c o m p o n e n t [ s / n ] · . . . . · − n 0 1 2 3 · . . . . · − nBestVariant OﬀsitePreselect5 OﬀsitePreselect10(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores) Fig. 2.

Comparison of AT strategies applied to

Cusp with varying n and Radau II A(7) . plotted up to n = 60 , , n inseconds needed by a speciﬁc variant to solve a timestep for Radau II A(7) . Tuning Cusp (Fig. 2). On Haswell (Fig. 2a),

OﬀsitePreselect5 and

OﬀsiteP-reselect10 select the same subset of three variants independent of n . Both strate-gies always correctly identify the measured best variant. The same observationscan be made on IvyBridge (Fig. 2b) and on

Skylake (Fig. 2c) where also thesame subset of three variants is selected and the measured best variant is alwaysfound.

Tuning IC (Fig. 3). On Haswell (Fig. 3a), the same subset of one (for

Oﬀ-sitePreselect5 ) respectively of two variants (for

OﬀsitePreselect10 ) is picked for n up to 8 , , n , both strategies select the same three variants.Except for n = 5 , , OﬀsitePreselect10 always correctly ﬁnds the measuredbest variant. The single variant selected by

OﬀsitePreselect5 is slightly oﬀ for n = 1 , ,

000 and n = 2 , , IC is compute-bound (Table 1) and, thus, the IVPevaluation dominates the computation time. Hence, in particular for small n ,the order of the remaining computations has only minor impact on the time andalready minor jitter can lead to a diﬀerent variant being selected. OﬀsitePreslect5 selects on

IvyBridge (Fig. 3b) the same variant for all n while OﬀsitePreselect10 adds two additional variants for n ≥ , , OﬀsitePreselect10 always ﬁnds the measured best variant,

OﬀsitePreselect5 isslightly oﬀ for n = 4 , ,

000 and n = 5 , ,

000 but the absolute time diﬀerenceis only marginal.On Skylake (Fig. 3c), the same variant is selected for n up to 1 , ,

000 by

OﬀsitePreselect5 as well as by

OﬀsitePreselect10 while for larger n two additional ﬀsite – A Performance Model Driven Autotuning Approach 15 · . . . . · − n T i m e p e r c o m p o n e n t [ s / n ] · . . . · − n 0 1 2 3 · . . · − nBestVariant OﬀsitePreselect5 OﬀsitePreselect10(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores) Fig. 3.

Comparison of AT strategies applied to IC with varying n and Radau II A(7) . variants are considered. Except for n = 1 , ,

000 both Oﬀsite strategies manageto always correctly identify the measured best variant. As on the two previoussystems, the absolute time diﬀerence is again only marginal.

Oﬀsite is capable of predicting the performance of an implementation variantfor diﬀerent core counts with a single AT run. In this AT scenario, we considertuning an IVP for a ﬁxed ODE system size n and multiple core counts.Figure 4 shows the eﬀectiveness of diﬀerent AT strategies compared to strat-egy RunAll when tuning IVP IC on three target platforms for n = 9 , , Radau II A(7) . On the x -axis, we plot the number of cores. The y -axis plotsfor diﬀerent AT strategies the percent performance gain Π achieved by applyingthat particular strategy instead of RunAll which tests all 56 variants ( t RA ). Theperformance gain is deﬁned as t RA − t AT t RA ∗

100 where t AT includes the time to runthe variants Λ tested by that strategy and the time to run the measured bestvariant from Λ an additional 56 − | Λ | times. Ideally, the bar of an AT strategywould be close to the horizontal line of BestVariant . Haswell (Fig. 4a).

Depending on the number of cores,

OﬀsitePreselect5 picksdiﬀerent subsets Λ . For core counts smaller than eight, the same variant is se-lected, while for eight cores two additional variants are selected. Using OﬀsiteP-reselect10 , these two variants are also included for four cores. For all core counts,a signiﬁcant performance gain close to

BestImplVariant can be observed whenusing the Oﬀsite strategies. The total performance gain grows with increasingnumber of cores as the performance gap between best and worst variants alsoincreases. While outperforming

RunAll , RandomSelect is still far oﬀ from themaximum gain. P e r f o r m a n ce ga i n [ % ] BestVariant OﬀsitePreselect5 OﬀsitePreselect10 RandomSelect (a) Haswell (b) IvyBridge (c) Skylake

Fig. 4.

Percent performance gain achieved by diﬀerent AT strategies when tuning IVP IC for diﬀerent core counts, Radau II A(7) and n = 9 , , IvyBridge (Fig. 4b).

OﬀsitePreselct5 selects the same variant for all corecounts. Using

OﬀsitePreselect10 , only for 20 cores two further variants are se-lected. Again, a signiﬁcant performance gain close to

BestImplVariant , can beobserved for all core counts when using the Oﬀsite strategies while

RandomSelect is far oﬀ from that ideal gain.

Skylake (Fig. 4c).

Both Oﬀsite strategies select the same three variants for20 cores, while the same single variant is selected for smaller core counts. As onthe two previous target platforms, both Oﬀsite strategies are close too

BestIm-plVariant while

RandomSelect is again further oﬀ.

In the last AT scenario, we consider tuning an IVP for a ﬁxed ODE systemsize n for four diﬀerent ODE methods. Depending on the characteristics of theODE method, diﬀerent optimizations might be applicable—for speciﬁc numberof stages s , e.g., loops over s can be replaced by a vector operation—whichpotentially results in varying eﬃciency of the same implementation variant fordiﬀerent ODE methods.Figure 5 shows the eﬀectiveness of diﬀerent AT strategies when tuning IVP IC on three target platforms for n = 9 , ,

000 and four diﬀerent ODE methods: (i) Radau I A(5) ( s = 3, m = 4), (ii) Radau II A(7) ( s = 4, m = 6), (iii) LobattoIII C(6) ( s = 4, m = 5), and (iv) Lobatto III C(8) ( s = 5, m = 7). On the x -axis the ODE method used is shown. The y -axis plots for each AT strategythe percent performance gain Π achieved by applying that particular strategyinstead of RunAll which tests all 56 variants. The bar of an AT strategy is ideallyclose to the horizontal line of

BestVariant . Tuning Haswell (Fig. 5a).

OﬀsitePreselect5 selects the same subset of twovariants for

Lobatto III C(6) and

Radau I A(5) . For

Lobatto III C(8) and

RadauII A(7) , an additional variant is selected. Using

OﬀsitePreselect10 , these three ﬀsite – A Performance Model Driven Autotuning Approach 17 L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) P e r f o r m a n ce ga i n [ % ] L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) BestVariant OﬀsitePreselect5 OﬀsitePreselect10 RandomSelect (a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores)

Fig. 5.

Percent performance gain achieved by diﬀerent AT strategies when tuning IVP IC for diﬀerent ODE methods and n = 9 , , variants are selected for all ODE methods. For all ODE methods, a signiﬁcantperformance gain close to BestImplVariant can be observed when using one ofthe two Oﬀsite strategies. Further, both Oﬀsite strategies decisively outperform

RandomSelect . IvyBridge (Fig. 5b).

For all ODE methods, the same single variant is chosenwhen using

OﬀsitePreselect5 , while

OﬀsitePreselect10 selects two variants for

Lo-batto III C(6) and the same three variants for

Lobatto III C(8) and

Radau II A(7) .As on

Haswell , the performance gain of both Oﬀsite strategies for all ODE meth-ods is close to the maximum gain, while the achieved gain of

RandomSelect isfar oﬀ from

BestImplVariant . Skylake (Fig. 5c).

Both Oﬀsite AT strategies select the same subset of threevariants for all ODE methods but for

Radau I A(5) which only selects two vari-ants when using

OﬀsitePreselect5 . Again, the performance gain achieved by bothOﬀsite strategies is close to

BestVariant while

RandomSearch is further oﬀ.

In this work, we have introduced the Oﬀsite AT approach which automates theprocess of identifying the most eﬃcient implementation variant(s) from a pool ofpossible variants at installation time. Oﬀsite ranks variants by their performanceusing analytic performance predictions. To facilitate specifying tuning scenarios,multilevel YAML description languages allow to describe these scenarios in anabstract way and enable Oﬀsite to automatically generate optimized codes. More-over, we have demonstrated that Oﬀsite can reliably tune a representative classof parallel explicit ODE methods, PIRK methods, by investigating diﬀerent ATscenarios and AT strategies on three diﬀerent shared-memory platforms.

Our future work includes expanding Oﬀsite to cluster systems as well as toAMD and ARM platforms. Further, we intend to extend Oﬀsite to a combinedoﬀline-online AT approach that incorporates feedback data from previous onlineAT (or program runs) and to study whether these data can be used to predictthe performance in scenarios with unknown input data (e.g. new IVP).

Acknowledgments

This work is supported by the German Ministry of Science and Education(BMBF) under project number 01IH16012A. Furthermore, we thank the Erlan-gen Regional Computing Center (RRZE) for granting access to their IvyBridgeand Skylake systems.

References

1. Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J.,O’Reilly, U.M., Amarasinghe, S.: OpenTuner: An Extensible Framework forProgram Autotuning. In: Proc. 23rd Int. Conf. on Parallel Architectureand Compilation Techniques. pp. 303–316. PACT ’14, ACM (Aug 2014).https://doi.org/10.1145/2628071.26280922. Balaprakash, P., Dongarra, J., Gamblin, T., Hall, M., Hollingsworth,J.K., Norris, B., Vuduc, R.: Autotuning in High-Performance ComputingApplications. Proceedings of the IEEE (11), 2068–2083 (Nov 2018).https://doi.org/10.1109/JPROC.2018.28412003. Barthel, A., G¨unther, M., Pulch, R., Rentrop, P.: Numerical Techniques for Diﬀer-ent Time Scales in Electric Circuit Simulation. In: High Performance Scientiﬁc andEngineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. pp. 343–360.HPSEC’ 02, Springer (Mar 2002). https://doi.org/10.1007/978-3-642-55919-8 384. Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing Matrix MultiplyUsing PHiPAC: A Portable, High-performance, ANSI C Coding Methodology. In:Proc. 11th Int. Conf. on Supercomputing. pp. 340–347. ICS ’97, ACM (Jul 1997).https://doi.org/10.1145/263580.2636625. Calvo, M., Franco, J.M., Randez, L.: A New Minimum Storage Runge-KuttaScheme for Computational Acoustics. Journal of Computational Physics (1),1–12 (Nov 2004). https://doi.org/10.1016/j.jcp.2004.05.0126. Das, S., Mullick, S.S., Suganthan, P.N.: Recent Advances in Diﬀerential Evolution– An Updated Survey. Swarm and Evolutionary Computation , 1–30 (Apr 2016).https://doi.org/10.1016/j.swevo.2016.01.0047. Hairer, E., Wanner, G.: Solving Ordinary Diﬀerential Equations II: Stiﬀ andDiﬀerential-Algebraic Problems. Springer, 2nd rev. edn. (2002)8. Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A Tool for AnalyticPerformance Modeling of Loop Kernels. In: Tools for High Performance Computing2016. pp. 1–22. Springer (Oct 2017). https://doi.org/10.1007/978-3-319-56702-0 19. Hofmann, J., Alappat, C., Hager, G., Fey, D., Wellein, G.: Bridging the Architec-ture Gap: Abstracting Performance-Relevant Properties of Modern Server Proces-sors (2019), preprintﬀsite – A Performance Model Driven Autotuning Approach 1910. van der Houwen, P. J.and Sommeijer, B.P.: Parallel Iteration of High-order Runge-Kutta Methods with Stepsize Control. J. Comput. Appl. Math. (1), 111–127(Jan 1990). https://doi.org/10.1016/0377-0427(90)90200-J11. Mazzia, F., Magherini, C.: Test Set for Initial Value Problem Solvers, Release 2.4. https://archimede.dm.uniba.it/~testset/ (Feb 2008)12. Pfaﬀe, P., Grosser, T., Tillmann, M.: Eﬃcient Hierarchical Online-Autotuning:A Case Study on Polyhedral Accelerator Mapping. In: Proc. ACM Int. Conf.Supercomputing. pp. 354–366. ICS ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3330345.333037713. Rasch, A., Gorlatch, S.: ATF: A Generic Directive-Based Auto-TuningFramework. Concurrency Computat. Pract. Exper. (5), e4423 (Mar 2019).https://doi.org/10.1002/cpe.442314. Scherg, M., Seiferth, J., Korch, M., Rauber, T.: Performance Prediction of Ex-plicit ODE Methods on Multi-Core Cluster Systems. In: Proc. 2019 ACM/SPECInt. Conf. on Performance Engineering. pp. 139–150. ICPE ’19, ACM (2019).https://doi.org/10.1145/3297663.331030615. Seiferth, J., Alappat, C., Korch, M., Rauber, T.: Applicability of theECM Performance Model to Explicit ODE Methods on Current Multi-core Processors. In: High Performance Computing: Proc. 33rd Int. Conf.,ISC High Performance 2018. pp. 163–183. ISC ’18, Springer (Jun 2018).https://doi.org/10.1007/978-3-319-92040-5 916. Shudler, S., Vrabec, J., Wolf, F.: Understanding the Scalability of Molec-ular Simulation Using Empirical Performance Modeling. In: Program-ming and Performance Visualization Tools. pp. 125–143. Springer (2019).https://doi.org/10.1007/978-3-030-17872-7 817. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying Performance Bot-tlenecks of Stencil Computations Using the Execution-Cache-Memory Model. In:Proc. 29th ACM Int. Conf. on Supercomputing. pp. 207–216. ICS ’15, ACM (2015).https://doi.org/10.1145/2751205.275124018. Tallent, N.R., Mellor-Crummey, J.M.: Eﬀective Performance Measurement andAnalysis of Multithreaded Applications. In: Proc. 14th ACM SIGPLAN Symp. onPrinciples and Practice of Parallel Programming. pp. 229–240. PPoPP ’09, ACM(Feb 2009). https://doi.org/10.1145/1504176.150421019. Tiwari, A., Hollingsworth, J.K.: Online Adaptive Code Generation and Tuning. In:Proc. 2011 IEEE Int. Parallel Distributed Processing Symp. pp. 879–892. IPDPS’11, IEEE (May 2011). https://doi.org/10.1109/IPDPS.2011.8620. Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations ofSoftware and the ATLAS Project. Parallel Computing (1), 3–35 (Jan 2001).https://doi.org/10.1016/S0167-8191(00)00087-921. Williams, S., Waterman, A., Patterson, D.: Rooﬂine: An Insightful Visual Perfor-mance Model for Multicore Architectures. Communications of the ACM52