Offsite Autotuning Approach -- Performance Model Driven Autotuning Applied to Parallel Explicit ODE Methods
aa r X i v : . [ c s . PF ] A p r Offsite Autotuning Approach
Performance Model Driven Autotuning Applied toParallel Explicit ODE Methods
Johannes Seiferth, Matthias Korch, Thomas Rauber
Department of Computer Science, University of Bayreuth, Bayreuth, Germany { johannes.seiferth, korch, rauber } @uni-bayreuth.de Abstract.
Autotuning (AT) is a promising concept to minimize the of-ten tedious manual effort of optimizing scientific application for a specifictarget platform. Ideally, an AT approach can reliably identify the mostefficient implementation variant(s) for a new platform or new character-istics of the input by applying suitable program transformations and an-alytic models. In this work, we introduce Offsite, an offline AT approachwhich automates this selection process at installation time by rating im-plementation variants based on an analytic performance model withoutrequiring time-consuming runtime tests. From abstract multilevel YAMLdescription languages, Offsite automatically derives optimized, platform-specific and problem-specific code of possible variants and applies theperformance model to these variants.We apply Offsite to parallel numerical methods for ordinary differentialequations (ODEs). In particular, we investigate tuning a specific class ofexplicit ODE solvers (PIRK methods) for four different initial value prob-lems (IVPs) on three different shared-memory systems. Our experimentsdemonstrate that Offsite can reliably identify the set of most efficient im-plementation variants for different given test configurations (ODE solver,IVP, platform) and effectively handle important AT scenarios.
Keywords:
Autotuning · performance modeling · description language · ODE methods · ECM performance model · shared-memory The performance of scientific applications strongly depends on the character-istics of the targeted computing platform, such as, e.g., the processor design,the core topology, the cache architectures, the memory latency or the memorybandwidth. Facing the growing diversity and complexity of today’s computinglandscape, the task of writing and maintaining highly efficient application codeis getting more and more cumbersome for software developers. A highly opti-mized implementation variant on one target platform, might, however, performpoorly on another platform. That particular poorly performant implementationvariant, though, could again potentially outperform all other variants on thenext platform. Hence, in order to achieve a high efficiency and obtain optimalperformance when migrating an existing scientific application, developers needto tune and adapt the application code for each specific platform anew.
Johannes Seiferth, Matthias Korch, Thomas Rauber
A promising concept to avoid this time-consuming, manual effort is autotun-ing (AT), and many different approaches have been proposed to automaticallytune software [2]. AT is based on two core concepts: (i) the generation of opti-mized implementation variants based on program transformation and optimiza-tion techniques such as, e.g., loop unrolling or loop tiling, and (ii) the selectionof the most efficient variant(s) on the target platform from the set of generatedvariants. In general, there are (i) offline and (ii) online
AT techniques. Offline AT tries to select the supposedly most efficient variant at compile or instal-lation time without actual knowledge of the input data. These approaches areapplicable for use-cases, whose execution behavior does no depend on the inputdata. This is the case, e.g., for dense linear algebra problems, which can, i.a., betuned offline with
ATLAS [20] and
PhiPAC [4]. In other fields, such as sparselinear algebra or particle codes, characteristics of the input data heavily influ-ence the execution behavior. By choosing the best variant at runtime—when allinput is known—, online AT approaches such as
Active Harmony [19] and
ATF [13] incorporate these influences.Selecting a suitable implementation variant from a potentially large set ofavailable variants in a time-efficient manner is a big challenge in AT. Varioustechniques and search strategies have been proposed in previous works to meetthis challenge [2]. A straightforward approach is the time-consuming comparisonof variants by runtime tests, possibly steered by a single search strategy, such asan exhaustive search or more sophisticated mathematical optimization methodslike differential evolution [6], or a combination of multiple search strategies [1].[12] proposes a hierarchical approach that allows the use of individual searchalgorithms for dependent subspaces of the search space. As an alternative toruntime tests, analytic performance models can be applied to either select themost efficient variant or to reduce the number of tests required by filtering outinefficient variants beforehand. In general, two categories of performance modelsare distinguished: (i) black box models applying statistical methods and ma-chine learning techniques to observed performance data like hardware metricsor measured runtimes in order to learn to predict performance behavior [16,18],and (ii) white box models such as the
Roofline model [21] or the
ECM per-formance model [9,17] that describe the interaction of hardware and code usingsimplified machine models. For loop kernels, the
Roofline and the
ECM model can be automatically constructed with the
Kerncraft tool [8].
In this work, we propose
Offsite , an offline AT approach that automaticallyidentifies the most efficient implementation variant(s) during installation timebased on performance predictions. These predictions stem from an analytic per-formance prediction methodology for explicit ODE methods proposed by [15]that uses a combined white and black box model approach based on the ECMmodel. The main contributions of this paper are: ffsite – A Performance Model Driven Autotuning Approach 3 for l ← , . . . , s do Y (0) l ← y κ for k ← , . . . , m do for l ← , . . . , s do Y ( k ) l ← y κ + h κ P si =1 a li F ( k − i with F ( k − i ← f ( t κ + c i h κ , Y ( k − i ) y κ +1 ← y κ + h κ P si =1 b i F ( m ) i Listing 1.
Timestep function of a PIRK method. (i)
We develop a novel offline AT approach for shared-memory systems basedon performance modelling. This approach automates the task of generating thepool of possible implementation variants using abstract description languages.For all these variants, our approach can automatically predict their performanceand identify the best variant(s). Further, we integrated a database interfacefor collected performance data which enables the reusability of data and whichallows to include feedback from possible online AT or actual program runs. (ii)
We show how to apply Offsite to an algorithm from numerical analysiswith complex runtime behavior: the parallel solution of IVPs of ODEs. (iii)
We validate the accuracy and efficiency of Offsite for different test con-figurations and discuss its applicability to four different AT scenarios.
Section 2 details the selected example use-case (PIRK methods) and the corre-sponding testbed. Based on this use-case, Offsite is described in Section 3. InSection 4, we experimentally evaluate Offsite in four different AT scenarios andon three different target platforms. Section 5 concludes the paper.
Use-Case: PIRK Methods
As example use-case, we investigate parallel iterated Runge-Kutta (PIRK).PIRK methods are part of the general class of explicit ODE methods [10] andsolve an ODE System y ′ ( t ) = f ( t, y ( t )) , y ( t ) = y , y ∈ R n (1)by performing a series of timesteps until the end of the integration interval isreached. In each timestep, a new numerical approximation y κ +1 for the unknownsolution y is determined by an explicit predictor–corrector process in a fixednumber of substeps.PIRK methods are an excellent candidate class for AT. Their complex four-dimensional loop structure (Lst. 1) can be modified by loop transformationsresulting in a large pool of possible implementation variants whose performancebehavior potentially varies highly depending on: (i) the composition of computa-tions and memory accesses, (ii) the number of stages of the base ODE method, (iii) the characteristics of the ODE system solved, (iv) the target hardware, (v) the compiler and the compiler flags, and (vi) the number of threads started. Johannes Seiferth, Matthias Korch, Thomas Rauber
Table 1.
Characteristics of the test set of IVPs.
IVP
Cusp IC Medakzo Wave1D
Acces distance unlimited limited unlimited limited Computational behavior mixed compute-bound mixed memory-bound In practice, many IVPs are sparse, i.e. only access few components of Y whenevaluating function f (line 4, Lst. 1). A special case of sparse is limited accessdistance d ( f ), where f j only accesses components y j − d ( f ) to y j + d ( f ) . Table 2.
Characteristics of the test set of target platforms.
Name
HSW IVB SKY
Micro-architecture
Haswell EP Ivy-Bridge EP Skylake SP
CPU
Xeon E5-2630 v3 Xeon E5-2660 v2 Xeon Gold 6148
Clock speed
Cores
L1 cache (data)
32 kB 32 kB 32 kB
L2 cache
256 kB 256 kB 1 MB
L3 cache (shared)
20 MB 25 MB 27.5 MB
Cache line size
64 B 64 B 64 B
Instruction throughput per cycle (double precision)
ADD / FMA / MUL 2 / 1 / 1 1 / - / 1 4 / 2 / 4
Compiler icc 19.0.5 icc 19.0.4 icc 19.0.2
Test Set of Initial Value Problems
In our experiments, we consider a broad set of IVPs (Table 1) that exhibitdifferent characteristics: (i) Cusp combines Zeeman’s cusp catastrophe modelfor a threshold-nerve-impulse mechanism with the van der Pol oscillator [7], (ii)IC describes a traversing signal through a chain of N concatenated inverters[3], (iii) Medakzo describes the penetration of radio-labeled antibodies into atissue infected by a tumor [11], and (iv) Wave1D describes the propagation ofdisturbances at a fixed speed in one direction [5]. Test Set of Target Platforms
We conducted our experiments on three different shared-memory systems (Table2). For all experiments, the CPU clock was fixed, hyper-threading disabled andthread binding set with
KMP AFFINITY=granularity=fine,compact . All codeswere compiled with the Intel C compiler and compiler flags -O3 , -xAVX and -fno-alias set. ffsite – A Performance Model Driven Autotuning Approach 5 T un i n g S ce n a r i o KernelWorking SetsKernel CodeGenerationDerived Impl.VariantsCommunicationCost Benchmarks Kernel predictionImpl. VariantPrediction Impl. VariantRanking Impl. VariantCode Generation DB Online AT
Offsite
Fig. 1.
Workflow of the Offsite autotuning approach.
In this work, we introduce the
Offsite offline AT approach on the example ofexplicit ODE methods. Before starting a new Offsite run, the tuning scenario desired, which consists of: (i) the pool of possible implementations and programtransformations, (ii) the ODE base method(s), (iii) the IVP(s), and (iv) thetarget platform, is defined using description languages in the YAML standard .From its input data, Offsite automatically handles the whole tuning workflow(Fig. 1). First, Offsite generates optimized, platform-specific and problem-specificcode for all kernels and derives all possible implementation variants. Applyingan analytic performance prediction methodology, the performance of each kernelis predicted for either (i) a fixed ODE system size n —if specified by the user orprescribed by the ODE —or (ii) a set of relevant ODE system sizes determinedby a working set model. The performance of a variant is derived by combin-ing the predictions of its kernels and adding an estimate of its synchronizationcosts. Variants are ranked by their performance to identify the most efficientvariant(s). All obtained prediction and ranking data are stored in a database.For the best ranked variants, Offsite generates optimized, platform-specific andproblem-specific code. A decisive, yet cumbersome step in AT is generating optimized code. Often,there is a large pool of possible implementation variants, applicable programtransformations (e.g. loop transformations) and tunable parameters (e.g. tilesizes) available. Furthermore, exploiting characteristics of the input data canenable more optimizations (e.g. constant propagation). Writing all variants byhand, however, would be tedious and error-prone and there is demand for au-tomation. In this work, we introduce multilevel description languages to describeimplementations, ODE methods, IVPs and target platforms in an abstract way.Offsite can interpret these languages and automatically derives optimized code. YAML is a data serialization language; https://yaml.org There are scalable ODE systems but also ODEs with a fixed size [7]. Johannes Seiferth, Matthias Korch, Thomas Rauber stages: 4 order: 7 corrector_steps: 6 A: [["0.1130", "-0.0403", "0.0258","-0.0099"], ..., [...]] b: ["0.2205", ...] c: ["0.1130 - 0.0403 + 0.0258 -0.0099", ...] Listing 2.
ODE method descriptionformat on the example of
RadauII A(7) . components: first: 1 size: n-1 code: | (((U_op - %in[j]) * R - (eta * ((%in[j-1] - U_thres) * (%in[j-1] -U_thres) - (%in[j-1] - %in[j] -U_thres) * (%in[j-1] - %in[j] -U_thres)))) / C); constants: - double R = 1.0 - ... Listing 3.
IVP description format on theexample of
InverterChain . The Base ODE Method of a PIRK method is characterized by its Butchertable—i.e., coefficient matrix A , weight vector b , node vector c —and a small setof properties: (i) number of stages s , (ii) order o , (iii) number of corrector steps m . Exploiting these properties, however, can have a large impact on the efficiencyof an implementation variant and should be included into the code generationin order to obtain the most efficient code. The i -loop in Listing 4, e.g., mightbe replaceable by a single vector operation for specific s , or zero entries in theButcher table might allow to save computations.Listing 2 shows the ODE method description format on the example of RadauII A(7) which is a four-stage method with order seven applying six corrector stepsper timestep. To save space, only an excerpt of the Butcher table is shown witha reduced number of digits.
IVPs are described in the IVP description format shown by IC (Lst. 3): (i) components describes the n components of the IVP. Each componentcontains a code YAML block that describes how function evaluation f ( t κ + c i h κ , Y ( k − i ) (l. 4, Lst. 1) will be substituted during code generation whereby %in is a placeholder for the used input vector Y ( k − i . Adjacent componentsthat execute the same computation can be described by a single block whereby first denotes the first component and size specifies the total number of adjacentcomponents handled by that particular block. (ii) constants defines IVP-specific parameters replaced with their actual val-ues during code generation and might possibly enable further code optimizations.In IVP IC , e.g., a multiplication could be saved if the given electrical resistance R equals 1 . Target Platform and Compiler are described using the machine descriptionformat introduced by Kerncraft . Its general structure is tripartite: (i) the exe-cution architecture description, (ii) the cache and memory hierarchy description,and (iii) benchmark results of typical streaming kernels. For example files, we refer to https://github.com/RRZE-HPC/kerncraft .ffsite – A Performance Model Driven Autotuning Approach 7 datastructs: - double b[s] - double F[s][n] - double dy[n] computations: C1: "dy[j] = dy[j] + b[i] * F[i][j]" variants: - name: APRX_ij code: | %PRAGMA nounroll_and_jam %LOOP_START i s %LOOP_START j n %COMP C1 %LOOP_END j %LOOP_END i working sets: { "(s+1)*n+s", "2*n" } - name: APRX_ji code: | %LOOP_START j n %LOOP_START i s unroll %COMP FMA %LOOP_END i %LOOP_END j working sets: { "(s+1)*n+s" } Listing 4.
Kernel template descriptionYAML on the example of
APRX . double F[4][1912]; // s=4; n=161 double dy[1912]; // n=161 for (int j=0; j<161; ++j) { // n=161;unrolled i; replaced b[i] dy[0] += 0.2205 * F[0][j]; dy[1] += 0.3882 * F[1][j]; dy[2] += 0.3288 * F[2][j]; dy[3] += 0.0625 * F[3][j]; } Listing 5.
Code generated for kernel
APRX ji of kernel template
APRX specialized on
Radau II A(7) & n = 161. code: | %COM omp_barrier %LOOP_START k m %KERNEL RHS %COM omp_barrier %KERNEL LC %COM omp_barrier %LOOP_END k %KERNEL RHS %COM omp_barrier %KERNEL APRX %KERNEL UPD Listing 6.
Implementation skeletondescription format on the example of A . void timestep(...) { for (k=0; k<6; ++k) { // m=6 // Code for kernel template RHS // Code for kernel template LC } // Code for kernel template RHS // Kernel APRX for (int j=0; j<161; ++j) { // n=161;unrolled i; replaced b[i] dy[0] += 0.2205 * F[0][j]; dy[1] += 0.3882 * F[1][j]; dy[2] += 0.3288 * F[2][j]; dy[3] += 0.0625 * F[3][j]; } // Code for kernel template UPD } Listing 7.
Code generated for avariant of impl. skeleton A using kernel APRX ji specialized on
Radau II A(7) & n = 161. Implementation Variants of numerical algorithms are abstracted by descrip-tion languages as (i) kernel templates and (ii) implementation skeletons . Kernel Templates define basic computation kernels and possible variationsof this kernel enabled by program transformations that preserve semantic cor-rectness. Listing 4 shows the kernel template description format on the exampleof
APRX , which covers computation P si =1 b i F ( m ) i (l. 5, Lst. 1): (i) datastructs defines required data structures. (ii) computations describes the computations covered by a kernel template.Each computation corresponds to a single line of code and has an unique iden-tifier (E.g. C1 in Lst. 4). Computations can contain IVP evaluations which aremarked by keyword %RHS and are replaced by an IVP component during codegeneration (E.g. for IC by line 5 of Lst. 3). Hence, if a kernel template contains %RHS , a separate, specialized kernel version has to be generated for each IVPcomponent. Johannes Seiferth, Matthias Korch, Thomas Rauber (iii) variants contains possible kernels of a kernel template enabled by pro-gram transformations. For each kernel, its workings sets ( working sets ) and itsprogram code ( code ) are specified. A code block defines for a kernel its orderof computations and program transformations and using four keywords. Com-putations are specified by keyword %COMP whose parameter must correspondto one of the identifiers defined in the computations block (E.g. C1 in Lst. 4).For-loop statements are defined by %LOOP START and %LOOP END . Thefirst parameter of %LOOP START specifies the loop variable name, the secondthe number of loop iterations and an optional third parameter unroll indicatesthat the loop will be unrolled during code generation. Loop-specific pragmas canbe added using %PRAGMA . Implementation skeletons define processing orders of kernel templates and re-quired communication points. From skeletons, concrete implementation variantsare derived by replacing its templates with concrete kernel code. Listing 6 showsthe implementation skeleton description format on the example of skeleton A which is a realization of a PIRK method (Lst. 1) that focuses on parallelismacross the ODE system, i.e its n equations are distributed blockwise among thethreads. A contains a loop k over the m corrector steps dividing each correc-tor step into two templates: RHS computes the IVP function evaluations (l. 5,Lst. 1) which are used to compute the linear combinations (l. 4, Lst. 1) in LC .Per corrector step, two synchronizations are needed as RHS —depending on theIVP solved—can potentially require all components of the linear combinationsfrom the last iteration of k . After all corrector steps are computed, the nextapproximation y κ +1 is calculated by templates APRX and
UPD (l. 6, Lst. 1).Four keywords suffice to specify skeletons: (i) %LOOP START and %LOOP END define for-loops. (ii) %COM states communication operations of an implementation skeleton.Skeleton A , e.g., requires 2 m + 2 barrier synchronizations. (iii) %KERNEL specifies an executed kernel template. Its parameter mustcorrespond to the name of a kernel template. During code generation %KERNEL is replaced by actual kernel code (e.g. APRX in Lst. 7).
Offsite can automatically identify the most efficient implementation variant(s)from a pool of available variants using analytic performance modelling (Fig. 1): (i)
In a first step, Offsite automatically generates code for all kernels in aspecial code format processable by kerncraft . Kernel code generation ( KernelCode Generation in Fig. 1) includes specializations of the code on the target plat-form, IVP, ODE method and (if fixed) ODE system size n . Listing 5 exemplaryshows the code generated for kernel APRX ji of kernel template
APRX (Lst. 4)when specialized on ODE method
Radau II A(7) and n = 161. As specified inthe template description, j loop is unrolled completely. Further, Butcher tablecoefficients ( b ) and known constants ( s = 4, n = 161) are substituted. In this work, version of the Kerncraft tool was used.ffsite – A Performance Model Driven Autotuning Approach 9 (ii)
In some tuning scenarios, the ODE system size n is not yet known duringinstallation time. Giving predictions for all valid n values, however, is in generalnot feasible. By applying a working set model (Sec. 3.4), Offsite automaticallydetermines for each kernel a set of relevant n ( Kernel Working Sets , Fig. 1) forwhich predictions are then obtained in the next step. (iii)
Offsite automatically computes node-level runtime predictions (Sec. 3.3)for each implementation variant (
Impl. Variant Prediction , Fig. 1) by adding upthe kernel predictions of its kernels and adding an estimate of its communicationcosts (
Communication Cost Benchmarks , Fig. 1), which Offsite derives frombenchmark data. For each of the kernel codes generated in step (i) , its kernelprediction is automatically derived by Offsite (
Kernel Prediction , Fig. 1) wherebyKerncraft is used to construct the ECM model. (iv)
Using these node-level runtime predictions, Offsite ranks implementationvariants by their performance (
Impl. Variant Ranking , Fig. 1). (v)
From the ranking of implementation variants, Offsite automatically de-rives the subset Λ of the best rated variant(s) which contains all variants λ whoseperformance is within an user-provided maximum deviation from the best ratedvariant. For each variant of λ , Offsite generates optimized, platform-specific andproblem-specific code ( Impl. Variant Code Generation , Fig. 1). Listing 7 showsan excerpt of the code generated for an variant of implementation skeleton A which substitutes kernel template APRX with kernel
APRX ji and was special-ized on ODE method
Radau II A(7) , IVP IC and n = 161. The performance prediction methodology applied by Offsite expands on theworks of [15] and comprises: (i) a node-level runtime prediction of an imple-mentation variant and (ii) an estimate of its intra-node communication costs.
Node-level Runtime Prediction.
Base of the node-level prediction is theanalytic
ECM (Execution-Cache-Memory) performance model . For an in-depthexplanation, we refer to [17,9]. The ECM model gives an estimation of the num-ber of CPU cycles per cache line (CL) required to execute a particular loopkernel on a multi- or many-core chip which includes contributions from the in-core execution time T core and the data transfer time T data : T core = max( T OL , T nOL ) , (2) T L3data = T dataL1L2 + T dataL2L3 + T pL2L3 . (3) T core is defined as the time required to retire the instructions of a single loopiteration under the assumptions that (i) there are no loop-carried dependen-cies, (ii) all data are in the L1 data cache, (iii) all instructions are scheduledindependently to the ports of the units, and (iv) the time to retire arithmeticinstructions and load/store operations can overlap due to speculative executiondepending on the target platform. Hence, the unit that takes the longest to retire its instructions determines T core . T leveldata factors in the time required to transferall data from its current location in the memory hierarchy to the L1 data cacheand back. The single contributions of transfers between levels i and j of thememory hierarchy T data ij are determined depending on the amount of transferredCLs. Depending on the platform used, an optional latency penalty T pij mightbe added. In (3) T leveldata is exemplarily shown for data coming from the L3 cacheunder the assumption that a latency penalty between L2 and L3 cache has tobe factored in on the platform used. Combining all contributions, a single-coreprediction T levelECM = max( T OL , T nOL + T leveldata ) (4)can be derived, whereby the overlapping capabilities of the target platform de-termine whether a contribution is considered overlapping or non-overlapping.Using Kerncraft, Offsite obtains ECM model predictions (4) for all kernels λ .For each kernel, kernel runtime prediction φ λ = α λ · β λ δ · f (5)yields the runtime in seconds of kernel λ , where α λ is (4) computed for a specificnumber of running cores τ , β λ is the number of loop iterations executed, δ is thenumber of data elements fitting into one CL and f is the CPU frequency. Bysumming up the individual kernel runtime predictions φ λ of its basic kernels λ and adding an estimate of its communication costs t com , the node-level runtimeprediction θ ǫ of an implementation variant ǫ is given by: θ ǫ = X λ φ λ + t com . (6) Remark: [15] used an older
Kerncraft version that could not yet return ECMpredictions for multiple core counts τ with a single run, but additionally returnedthe kernel’s saturation point σ λ . Hence, an extra factor min( τ, σ λ ) was neededin the denominator of (5) in their work. Estimate of Intra-node Communication Costs.
For the implementationvariants considered in this work, the costs of the required OpenMP barrier oper-ations are estimated depending on the number of threads using linear regression.
Reusability of Performance Predictions.
Prediction data (e.g. kernel run-time predictions) collected for a specific implementation variant can be reusedto estimate other variants (if they also include that kernel) or to estimate otherIVPs (if the kernel does not contain any IVP evaluations). In the context ofAT, this is a decisive advantage compared to time-consuming runtime testingof variants which requires running each additionally added variant as well or torun all variants anew when changing the IVP solved. ffsite – A Performance Model Driven Autotuning Approach 11
If the ODE system size n is not fixed—either by the user or restrictions of theIVP—selecting the most efficient implementation variant(s) at installation timeleads to an exhaustive search over the possibly vast space of values for n . Tominimize the number of predictions required per kernel, the set of estimated n values is reduced by a model-based restriction, the working set of the kernel,which corresponds to the amount of data referenced by a kernel.We use the working sets to identify for each kernel the maximum n that stillfit into the single cache levels. Using these maximums, ranges of consecutive n values for which the ECM prediction (4) stays constant can be derived. Themedium values of these ranges form the working set of the kernel. We validate Offsite using the experimental test bed introduced in Section 2. Inparticular, we study the efficiency of Offsite in four AT scenarios when tuningfour different IVPs on three different target platforms and compare the efficiencyof the following AT strategies: (i) BestVariant covers the case that the most efficient implementation variantis already known (e.g. from previous execution) and no AT is required. (ii) RunAll runs all variants in order to identify the most efficient variant. (iii) OffsitePreselect5 runs an Offsite determined subset of all variants, whichcontains all variants withing a 5 % deviation of the best ranked variant, toidentify the most efficient variant of that subset. (iv) OffsitePreselect10 runs variants within a 10 % deviation of the best variant. (v) RandomSelect randomly runs 20 of the total 56 variants.
Table 3 summarizes the implementation skeletons and kernel templates used inthis work. In total, we consider eight skeletons from which 56 implementationvariants can be derived. Each table column shows the templates required by aparticular skeleton. E.g., skeleton A (Lst. 6) uses templates LC , RHS , APRX and
UPD and twelve variants can be derived from A as there are six different kernelsof LC (enabled by loop interchanges, unrolls, pragmas) and two of APRX .In total, 17 different kernels can be derived from the eight kernel templatesavailable. To predict the performance of all 56 variants, only these 17 kernelshave to be estimated. Further, when obtaining predictions off all 56 variants fora different IVP, only those four templates that contain IVP evaluations—andthus their six corresponding kernels—need to be re-evaluated, while predictiondata of the remaining kernels can be retrieved from database. The ECM prediction factors in the location of data in the memory hierarchy. As asimplified assumption—neglecting overlapping effects at cache borders—, this meansthat as long as data locations do not change, the ECM model yields the same valuefor a kernel independent from the actual n .2 Johannes Seiferth, Matthias Korch, Thomas Rauber Table 3.
Overview of the implementation variants considered.
Impl. Skeleton Kernel Template ( ( A kernel template marked with * contains evaluations of the IVP. As first test scenario, we consider the case that all input is known at installationtime, in particular the ODE system size n . In such cases, Offsite is applied with-out the working set model. Performance predictions, however, are only obtainedfor that particular n and a new Offsite AT run would be required if n changes.Table 4 compares the accuracy and efficiency of AT strategies when tuningfour different IVPs on three different target platforms for n = 36 , ,
000 andODE method
Radau II A(7) . For AT strategies
OffsitePreselect5 and
OffsitePre-select10 , t step yields the time in seconds it takes to execute a timestep using themeasured best implementation variant from the subset Λ of variants λ tested bythat strategy. Performance loss denotes the percent runtime deviation of thatparticular measured best variant from the variant selected by
BestVariant ( t best ).Ideally, an AT strategy correctly identifies the measured best variant and, thus,would suffer no performance loss. For an AT strategy, | Λ | yields the cardinalityof subset Λ and the percent tuning overhead of applying that strategy is definedas t tune −| Λ | t best | Λ | t best ·
100 where t tune = P Λλ t λ is the time required to test all variantsand | Λ | t best is the time needed to execute the measured best variant instead. Haswell.
AT strategy
RunAll causes a significant tuning overhead for all IVPs,while
OffsitePreselect5 and
OffsitePreselect10 only lead to marginal overheadas the subset of tested variants is considerably smaller, while still being able toselect the measured best variant for all IVPs but
Wave1D . IvyBridge.
Again,
RunAll leads to decisive overhead compared to both Off-site strategies and the measured best variant is correctly identified for all IVPs.However, for IVP IC only OffsitePreselect10 finds the best variant. As IC iscompute-bound (Table 1), the IVP evaluation dominates the computation timewhile the order of the remaining computations has only minor impact. Hence,already minor jitter can lead to a different variant being selected. Skylake.
Similar observations as on the two previous systems can be made.The overhead of both Offsite strategies is marginal compared to
RunAll . For allIVPs, the measured best variant is successfully identified. ffsite – A Performance Model Driven Autotuning Approach 13
Table 4.
Comparison of different AT strategies applied to four different IVPs with n = 36 , ,
000 and
Radau II A(7) . IVP Cusp IC Medakzo Wave1D H a s w e ll ( c o r e s ) BestVariant
F *ji F *ij F *ji H LCjli *ij
BestVariant t step [s] AT strategy – OffsitePreselect5 | Λ | (tuning overhead) 3 (1%) 3 (3%) 2 (2%) 3 (5%) t step [s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%) AT strategy – OffsitePreselect10 | Λ | (tuning overhead) 3 (1%) 3 (3%) 3 (1%) 4 (5%) t step [s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%) AT strategy – RunAll | Λ | (tuning overhead) 56 (42%) 56 (44%) 56 (20%) 56 (16%) I vy B r i d g e ( c o r e s ) BestVariant
E *ji F *ij F *ji F *ji
BestVariant t step [s] AT strategy – OffsitePreselect5 | Λ | (tuning overhead) 3 (3%) 1 (1%) 3 (1%) 3 (0.4%) t step [s] selected variant (perf. loss) 1.16 (–) 0.734 (1%) 1.20 (–) 1.04 (–) AT strategy – OffsitePreselect10 | Λ | (tuning overhead) 3 (3%) 3 (3%) 3 (1%) 3 (0.4%) t step [s] selected variant (perf. loss) 1.16 (–) 0.725 (–) 1.20 (–) 1.04 (–) AT strategy – RunAll | Λ | (tuning overhead) 56 (54%) 56 (59%) 56 (41%) 56 (44%) S ky l a k e ( c o r e s ) BestVariant
E *ji F *ji F *ji F *ji
BestVariant t step [s] AT strategy – OffsitePreselect5 | Λ | (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (%) t step [s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–) AT strategy – OffsitePreselect10 | Λ | (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (1%) t step [s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–) AT strategy – RunAll | Λ | (tuning overhead) 56 (69%) 56 (65%) 56 (41%) 56 (44%) The next scenario considered is that of a still unknown ODE system size n atinstallation time. In these cases, the working set model is applied to determinea set of sample n values for which Offsite computes predictions and from whichpredictions for the whole range of possible n are derived. As this requires com-puting multiple performance predictions, a single Offsite run takes longer thanin the previous scenario. This particular Offsite run, however, already covers allpossible n and no further run will be required when switching n at a later point.Figures 2 and 3 show for the single implementation variants selected as bestvariant by the AT strategies considered, the time per timestep of IC and Cusp on three platforms (each using their max. number of cores). On the x-axis, n is · . . . . · − n T i m e p e r c o m p o n e n t [ s / n ] · . . . . · − n 0 1 2 3 · . . . . · − nBestVariant OffsitePreselect5 OffsitePreselect10(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores) Fig. 2.
Comparison of AT strategies applied to
Cusp with varying n and Radau II A(7) . plotted up to n = 60 , , n inseconds needed by a specific variant to solve a timestep for Radau II A(7) . Tuning Cusp (Fig. 2). On Haswell (Fig. 2a),
OffsitePreselect5 and
OffsiteP-reselect10 select the same subset of three variants independent of n . Both strate-gies always correctly identify the measured best variant. The same observationscan be made on IvyBridge (Fig. 2b) and on
Skylake (Fig. 2c) where also thesame subset of three variants is selected and the measured best variant is alwaysfound.
Tuning IC (Fig. 3). On Haswell (Fig. 3a), the same subset of one (for
Off-sitePreselect5 ) respectively of two variants (for
OffsitePreselect10 ) is picked for n up to 8 , , n , both strategies select the same three variants.Except for n = 5 , , OffsitePreselect10 always correctly finds the measuredbest variant. The single variant selected by
OffsitePreselect5 is slightly off for n = 1 , ,
000 and n = 2 , , IC is compute-bound (Table 1) and, thus, the IVPevaluation dominates the computation time. Hence, in particular for small n ,the order of the remaining computations has only minor impact on the time andalready minor jitter can lead to a different variant being selected. OffsitePreslect5 selects on
IvyBridge (Fig. 3b) the same variant for all n while OffsitePreselect10 adds two additional variants for n ≥ , , OffsitePreselect10 always finds the measured best variant,
OffsitePreselect5 isslightly off for n = 4 , ,
000 and n = 5 , ,
000 but the absolute time differenceis only marginal.On Skylake (Fig. 3c), the same variant is selected for n up to 1 , ,
000 by
OffsitePreselect5 as well as by
OffsitePreselect10 while for larger n two additional ffsite – A Performance Model Driven Autotuning Approach 15 · . . . . · − n T i m e p e r c o m p o n e n t [ s / n ] · . . . · − n 0 1 2 3 · . . · − nBestVariant OffsitePreselect5 OffsitePreselect10(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores) Fig. 3.
Comparison of AT strategies applied to IC with varying n and Radau II A(7) . variants are considered. Except for n = 1 , ,
000 both Offsite strategies manageto always correctly identify the measured best variant. As on the two previoussystems, the absolute time difference is again only marginal.
Offsite is capable of predicting the performance of an implementation variantfor different core counts with a single AT run. In this AT scenario, we considertuning an IVP for a fixed ODE system size n and multiple core counts.Figure 4 shows the effectiveness of different AT strategies compared to strat-egy RunAll when tuning IVP IC on three target platforms for n = 9 , , Radau II A(7) . On the x -axis, we plot the number of cores. The y -axis plotsfor different AT strategies the percent performance gain Π achieved by applyingthat particular strategy instead of RunAll which tests all 56 variants ( t RA ). Theperformance gain is defined as t RA − t AT t RA ∗
100 where t AT includes the time to runthe variants Λ tested by that strategy and the time to run the measured bestvariant from Λ an additional 56 − | Λ | times. Ideally, the bar of an AT strategywould be close to the horizontal line of BestVariant . Haswell (Fig. 4a).
Depending on the number of cores,
OffsitePreselect5 picksdifferent subsets Λ . For core counts smaller than eight, the same variant is se-lected, while for eight cores two additional variants are selected. Using OffsiteP-reselect10 , these two variants are also included for four cores. For all core counts,a significant performance gain close to
BestImplVariant can be observed whenusing the Offsite strategies. The total performance gain grows with increasingnumber of cores as the performance gap between best and worst variants alsoincreases. While outperforming
RunAll , RandomSelect is still far off from themaximum gain. P e r f o r m a n ce ga i n [ % ] BestVariant OffsitePreselect5 OffsitePreselect10 RandomSelect (a) Haswell (b) IvyBridge (c) Skylake
Fig. 4.
Percent performance gain achieved by different AT strategies when tuning IVP IC for different core counts, Radau II A(7) and n = 9 , , IvyBridge (Fig. 4b).
OffsitePreselct5 selects the same variant for all corecounts. Using
OffsitePreselect10 , only for 20 cores two further variants are se-lected. Again, a significant performance gain close to
BestImplVariant , can beobserved for all core counts when using the Offsite strategies while
RandomSelect is far off from that ideal gain.
Skylake (Fig. 4c).
Both Offsite strategies select the same three variants for20 cores, while the same single variant is selected for smaller core counts. As onthe two previous target platforms, both Offsite strategies are close too
BestIm-plVariant while
RandomSelect is again further off.
In the last AT scenario, we consider tuning an IVP for a fixed ODE systemsize n for four different ODE methods. Depending on the characteristics of theODE method, different optimizations might be applicable—for specific numberof stages s , e.g., loops over s can be replaced by a vector operation—whichpotentially results in varying efficiency of the same implementation variant fordifferent ODE methods.Figure 5 shows the effectiveness of different AT strategies when tuning IVP IC on three target platforms for n = 9 , ,
000 and four different ODE methods: (i) Radau I A(5) ( s = 3, m = 4), (ii) Radau II A(7) ( s = 4, m = 6), (iii) LobattoIII C(6) ( s = 4, m = 5), and (iv) Lobatto III C(8) ( s = 5, m = 7). On the x -axis the ODE method used is shown. The y -axis plots for each AT strategythe percent performance gain Π achieved by applying that particular strategyinstead of RunAll which tests all 56 variants. The bar of an AT strategy is ideallyclose to the horizontal line of
BestVariant . Tuning Haswell (Fig. 5a).
OffsitePreselect5 selects the same subset of twovariants for
Lobatto III C(6) and
Radau I A(5) . For
Lobatto III C(8) and
RadauII A(7) , an additional variant is selected. Using
OffsitePreselect10 , these three ffsite – A Performance Model Driven Autotuning Approach 17 L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) P e r f o r m a n ce ga i n [ % ] L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) L o b a tt o III C ( ) L o b a tt o III C ( ) R a d a u I A ( ) R a d a u II A ( ) BestVariant OffsitePreselect5 OffsitePreselect10 RandomSelect (a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores)
Fig. 5.
Percent performance gain achieved by different AT strategies when tuning IVP IC for different ODE methods and n = 9 , , variants are selected for all ODE methods. For all ODE methods, a significantperformance gain close to BestImplVariant can be observed when using one ofthe two Offsite strategies. Further, both Offsite strategies decisively outperform
RandomSelect . IvyBridge (Fig. 5b).
For all ODE methods, the same single variant is chosenwhen using
OffsitePreselect5 , while
OffsitePreselect10 selects two variants for
Lo-batto III C(6) and the same three variants for
Lobatto III C(8) and
Radau II A(7) .As on
Haswell , the performance gain of both Offsite strategies for all ODE meth-ods is close to the maximum gain, while the achieved gain of
RandomSelect isfar off from
BestImplVariant . Skylake (Fig. 5c).
Both Offsite AT strategies select the same subset of threevariants for all ODE methods but for
Radau I A(5) which only selects two vari-ants when using
OffsitePreselect5 . Again, the performance gain achieved by bothOffsite strategies is close to
BestVariant while
RandomSearch is further off.
In this work, we have introduced the Offsite AT approach which automates theprocess of identifying the most efficient implementation variant(s) from a pool ofpossible variants at installation time. Offsite ranks variants by their performanceusing analytic performance predictions. To facilitate specifying tuning scenarios,multilevel YAML description languages allow to describe these scenarios in anabstract way and enable Offsite to automatically generate optimized codes. More-over, we have demonstrated that Offsite can reliably tune a representative classof parallel explicit ODE methods, PIRK methods, by investigating different ATscenarios and AT strategies on three different shared-memory platforms.
Our future work includes expanding Offsite to cluster systems as well as toAMD and ARM platforms. Further, we intend to extend Offsite to a combinedoffline-online AT approach that incorporates feedback data from previous onlineAT (or program runs) and to study whether these data can be used to predictthe performance in scenarios with unknown input data (e.g. new IVP).
Acknowledgments
This work is supported by the German Ministry of Science and Education(BMBF) under project number 01IH16012A. Furthermore, we thank the Erlan-gen Regional Computing Center (RRZE) for granting access to their IvyBridgeand Skylake systems.
References
1. Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J.,O’Reilly, U.M., Amarasinghe, S.: OpenTuner: An Extensible Framework forProgram Autotuning. In: Proc. 23rd Int. Conf. on Parallel Architectureand Compilation Techniques. pp. 303–316. PACT ’14, ACM (Aug 2014).https://doi.org/10.1145/2628071.26280922. Balaprakash, P., Dongarra, J., Gamblin, T., Hall, M., Hollingsworth,J.K., Norris, B., Vuduc, R.: Autotuning in High-Performance ComputingApplications. Proceedings of the IEEE (11), 2068–2083 (Nov 2018).https://doi.org/10.1109/JPROC.2018.28412003. Barthel, A., G¨unther, M., Pulch, R., Rentrop, P.: Numerical Techniques for Differ-ent Time Scales in Electric Circuit Simulation. In: High Performance Scientific andEngineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. pp. 343–360.HPSEC’ 02, Springer (Mar 2002). https://doi.org/10.1007/978-3-642-55919-8 384. Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing Matrix MultiplyUsing PHiPAC: A Portable, High-performance, ANSI C Coding Methodology. In:Proc. 11th Int. Conf. on Supercomputing. pp. 340–347. ICS ’97, ACM (Jul 1997).https://doi.org/10.1145/263580.2636625. Calvo, M., Franco, J.M., Randez, L.: A New Minimum Storage Runge-KuttaScheme for Computational Acoustics. Journal of Computational Physics (1),1–12 (Nov 2004). https://doi.org/10.1016/j.jcp.2004.05.0126. Das, S., Mullick, S.S., Suganthan, P.N.: Recent Advances in Differential Evolution– An Updated Survey. Swarm and Evolutionary Computation , 1–30 (Apr 2016).https://doi.org/10.1016/j.swevo.2016.01.0047. Hairer, E., Wanner, G.: Solving Ordinary Differential Equations II: Stiff andDifferential-Algebraic Problems. Springer, 2nd rev. edn. (2002)8. Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A Tool for AnalyticPerformance Modeling of Loop Kernels. In: Tools for High Performance Computing2016. pp. 1–22. Springer (Oct 2017). https://doi.org/10.1007/978-3-319-56702-0 19. Hofmann, J., Alappat, C., Hager, G., Fey, D., Wellein, G.: Bridging the Architec-ture Gap: Abstracting Performance-Relevant Properties of Modern Server Proces-sors (2019), preprintffsite – A Performance Model Driven Autotuning Approach 1910. van der Houwen, P. J.and Sommeijer, B.P.: Parallel Iteration of High-order Runge-Kutta Methods with Stepsize Control. J. Comput. Appl. Math. (1), 111–127(Jan 1990). https://doi.org/10.1016/0377-0427(90)90200-J11. Mazzia, F., Magherini, C.: Test Set for Initial Value Problem Solvers, Release 2.4. https://archimede.dm.uniba.it/~testset/ (Feb 2008)12. Pfaffe, P., Grosser, T., Tillmann, M.: Efficient Hierarchical Online-Autotuning:A Case Study on Polyhedral Accelerator Mapping. In: Proc. ACM Int. Conf.Supercomputing. pp. 354–366. ICS ’19, ACM, New York, NY, USA (2019).https://doi.org/10.1145/3330345.333037713. Rasch, A., Gorlatch, S.: ATF: A Generic Directive-Based Auto-TuningFramework. Concurrency Computat. Pract. Exper. (5), e4423 (Mar 2019).https://doi.org/10.1002/cpe.442314. Scherg, M., Seiferth, J., Korch, M., Rauber, T.: Performance Prediction of Ex-plicit ODE Methods on Multi-Core Cluster Systems. In: Proc. 2019 ACM/SPECInt. Conf. on Performance Engineering. pp. 139–150. ICPE ’19, ACM (2019).https://doi.org/10.1145/3297663.331030615. Seiferth, J., Alappat, C., Korch, M., Rauber, T.: Applicability of theECM Performance Model to Explicit ODE Methods on Current Multi-core Processors. In: High Performance Computing: Proc. 33rd Int. Conf.,ISC High Performance 2018. pp. 163–183. ISC ’18, Springer (Jun 2018).https://doi.org/10.1007/978-3-319-92040-5 916. Shudler, S., Vrabec, J., Wolf, F.: Understanding the Scalability of Molec-ular Simulation Using Empirical Performance Modeling. In: Program-ming and Performance Visualization Tools. pp. 125–143. Springer (2019).https://doi.org/10.1007/978-3-030-17872-7 817. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying Performance Bot-tlenecks of Stencil Computations Using the Execution-Cache-Memory Model. In:Proc. 29th ACM Int. Conf. on Supercomputing. pp. 207–216. ICS ’15, ACM (2015).https://doi.org/10.1145/2751205.275124018. Tallent, N.R., Mellor-Crummey, J.M.: Effective Performance Measurement andAnalysis of Multithreaded Applications. In: Proc. 14th ACM SIGPLAN Symp. onPrinciples and Practice of Parallel Programming. pp. 229–240. PPoPP ’09, ACM(Feb 2009). https://doi.org/10.1145/1504176.150421019. Tiwari, A., Hollingsworth, J.K.: Online Adaptive Code Generation and Tuning. In:Proc. 2011 IEEE Int. Parallel Distributed Processing Symp. pp. 879–892. IPDPS’11, IEEE (May 2011). https://doi.org/10.1109/IPDPS.2011.8620. Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations ofSoftware and the ATLAS Project. Parallel Computing (1), 3–35 (Jan 2001).https://doi.org/10.1016/S0167-8191(00)00087-921. Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Perfor-mance Model for Multicore Architectures. Communications of the ACM52