A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling
AA mechanism for balancing accuracy and scope incross-machine black-box GPU performance modeling
James D. Stevens , Andreas Kl¨ockner Abstract
The ability to model, analyze, and predict execution time of computations is an important building block that supportsnumerous efforts, such as load balancing, benchmarking, job scheduling, developer-guided performance optimization, andthe automation of performance tuning for high performance, parallel applications. In today’s increasingly heterogeneouscomputing environment, this task must be accomplished efficiently across multiple architectures, including massivelyparallel coprocessors like GPUs, which are increasingly prevalent in the world’s fastest supercomputers. To address thischallenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels,including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool forformulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernelsused to calibrate models to GPUs in a black-box fashion. With this approach, we empower the user to manage trade-offsbetween model accuracy, evaluation speed, and generalizability. A user can define their own model and customize thecalibration process, making it as simple or complex as desired, and as application-targeted or general-purpose as desired.As application examples of our approach, we demonstrate both linear and nonlinear models; these examples are designedto predict execution times for multiple variants of a particular computation: two matrix-matrix multiplication variants,four Discontinuous Galerkin (DG) differentiation operation variants, and two 2-D five-point finite difference stencilvariants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We viewthis highly user-customizable approach as a response to a central question arising in GPU performance modeling: howcan we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability,and ease of use, an attribute we believe precludes approaches requiring manual collection of kernel or hardware statistics.
Keywords
Performance model, GPU, Microbenchmark, Code generation, Black box, OpenCL
Maximizing computational performance requires tai-loring an application implementation to the targetarchitecture. As a result, obtaining and maintaining goodperformance in a heterogeneous computing environmentnecessitates the ability to efficiently decide betweenmultiple mathematically equivalent program variants.Being able to model, interpret, and predict the executiontime of computational kernels can provide insight intofactors contributing to computation cost, and doingso in an automated, architecture-independent fashionis a key step toward the automation of performancetuning for complicated, modern, vector-based, massivelyparallel processor architectures including recent CPUsand GPUs.GPUs, originally designed for rapid graphics rendering,have highly parallel single instruction, multiple thread(SIMT) architectures that make them particularly usefulfor data-parallel problems. Over the last decade, generalpurpose GPU programming has risen in popularity.GPUs are increasingly prevalent in the worlds fastestsupercomputers, and are being utilized in an expandingbody of applications including machine learning andartificial intelligence. GPU programming has beenfacilitated by the release of general purpose GPUprogramming systems, including Nvidia CUDA in 2007 and the Open Computing Language (OpenCL) in 2009(Munshi et al. 2011; Nickolls et al. 2008).Tailoring a performance model to a particularcomputation on a single hardware device may yield highaccuracy. When broadening the scope of architecturesand computations targeted by a model, achieving highaccuracy becomes increasingly difficult. We present amechanism for putting the user in control of this trade-off between model accuracy and generalizability with anapproach for creating custom performance models thatis realized on top of, though technically not dependenton, a program transformation system.We consider this contribution a building block toprovide guidance in exploring the vast search spaceof possible and, from the point of view of the result,equivalent program variants, by either a developer or anauto-tuning compiler. We view the models constructedwithin our framework as more economical alternativesto evaluating execution time of computational kernelsthan, for example, using actual on-device timing runs.Our system primarily targets execution on modern GPU University of Illinois at Urbana-Champaign
Corresponding author:
James D. Stevens; 201 N Goodwin Ave; Urbana, IL 61801Email: [email protected]
Prepared using sagej.cls [Version: 2017/01/17 v1.20] a r X i v : . [ c s . PF ] N ov XX(X) hardware, as exposed in, for example, the CUDA orOpenCL compute abstractions. To facilitate portabilityand maximize ease of use, the system makes fewassumptions about the internal organization of thehardware, and device-specific parameters are obtainedfrom a black-box model-calibration process that needsto run precisely once per model per hardware deviceused. We demonstrate that this cross-machine, black-box,microbenchmarking approach to analytical performancemodeling can predict kernel execution time well enoughto determine which of multiple implementation variantswill yield the shortest execution times.We review related performance modeling work inSection 9, and discuss two recent surveys of the currentGPU performance modeling landscape, Madougou et al.(2016) and Lopez-Novoa et al. (2015). Like the surveyauthors, we find that many existing GPU performancemodels predict well for a particular application orarchitecture but are not easily portable across machines.Most require knowledge of hardware or applicationcharacteristics, often gathered manually, and significanteffort to construct and use. Compared to analyticalapproaches, learning and statistical techniques tend tobe more hardware-flexible, but the models produced areless accessible to users from the standpoints of bothdesign and interpretability; assumptions and limitationsabout model predictive power, fidelity, and range ofapplicable programs and hardware tend to be less clear.Madougou et al. (2016) conclude that software utilizingthese methods may be difficult to use for users withoutgood knowledge of statistical methods. None of theapproaches we surveyed provide users any significantcontrol over the model expression or (micro)benchmarkdesign.The following combination of factors distinguishes ourwork from previous GPU performance modeling work,and comprises our primary contributions: • Our approach allows broad customization of themathematical model not available in previous work.A user can rely on predefined generic models ordefine their own model, making it as simple or ascomplex as desired, as application-targeted or asgeneral-purpose as desired. • Similarly, our approach allows for complete cus-tomization of the set of measurement computationsused to compute the model parameters during themodel calibration process. • We automate the gathering of all performance-relevant kernel features used to model executiontime. These features are gathered pre-compilationby examining our polyhedrally-based programrepresentation. • In the example models we show in Section 8, theonly hardware statistic required is the sub-groupsize , which is 32 on all current Nvidia and AMDhardware generations. This demonstrates that ablack-box approach to performance modeling cancapture execution cost behavior with very limitedexplicit representation of hardware characteristics. (Sub-groups and other details of the OpenCLmachine model are discussed in Section 1.2.) • Models created using our approach are hardwarevendor- and generation-independent, and wedemonstrate performance on an AMD GPU andfour generations of Nvidia GPUs. • Models created using our approach are amenableto human understanding: through the exposedparameters and their known meanings, it becomespossible to reason about which parts of kernelexecution cost are attributed to which specificoperations. • By making use of a polyhedrally-based programrepresentation, we obtain precise counts of variousunits of work performed by the static-control partof a program. The counts obtained are parametricin the problem size, allowing us to amortizecounting work over repeated applications of themodel to the same kernels with varying problemsize. • To help identify and measure cost contributionsfrom individual memory accesses, we introduce acode transformation that can remove unrelatedoperations from a computation, thereby obtaininga kernel exercising the targeted memory access. • To help model temporal overlap, e.g., betweencomputation and memory operations, we introducea modeling paradigm that reflects the reducedapparent cost of the overlapped fraction ofcomputation. We demonstrate the use of theresulting, complex model expressions within our(unmodified) black-box measurement and timingfacilities to provide accurate performance modelingeven in the challenging case of overlappedoperations.
For the representation of programs, as well as to enableour static counting of operations, we rely on the
Loopy program transformation system (Kl¨ockner 2014, 2015).While we make use of certain capabilities available in
Loopy , this is not a hard dependency, in that anysystem offering a given interface could assume this role.In this section, we first briefly introduce
Loopy andthen describe that abstract interface.
Loopy is a programming system for arraycomputations that targets CPUs, GPUs, and other,potentially heterogeneous, compute architectures. Thissystem keeps the mathematical intent of a computationstrictly separated from computational and performance-related minutiae. To attain that goal,
Loopy realizesprograms as objects in a host programming language(Python in this concrete case) that can be manipulatedfrom their initial, “clean,” mathematical statement intohighly device-specific, optimized versions via a broadarray of transformations. From these program objects,
Loopy generates and executes source code in a rangeof output languages, including OpenCL, the output
Prepared using sagej.cls tevens, Kl¨ockner language we use for Loopy programs demonstrated inthis work.Our methodology to automatically gather the kernelstatistics underlying kernel features that are used byour modeling process leverages the
Loopy programmingsystem in a number of ways: • we express our kernels in its intermediaterepresentation based on a generic OpenCL/CUDA-style machine model, • we use its program transformation vocabularyto obtain computationally different but mathe-matically equivalent variants of our measurementkernels, • we use it to generate OpenCL C-level sourcecode for the various target machines on whichwe evaluate our model. While Loopy is able totarget a much larger range of output languages(e.g., C, OpenMP+SIMD, OpenCL, ISPC, CUDA),we limit ourselves to only OpenCL in keeping withour focus on GPUs, and finally, • we make use of Loopy ’s polyhedrally-basedinternal representation to support the automaticextraction of kernel statistics.One of the primary design goals of
Loopy , Perflex ,and
UIPiCK is to avoid the need to write detailedOpenCL- or CUDA-level code manually, therebyreducing development cost. We view this reductionof manual kernel construction as a valuable feature.Nonetheless, our modeling approach could be used withless reliance on
Loopy , or none at all.For example, the automated statistics-gathering pieceof our work is notionally independent of our modelingprocess in the sense that, while it is convenient tohave the ability to automatically extract the kernelfeatures being used in our models, it is not technicallynecessary and could be achieved either by hand orin a technologically different manner. One tool thatcould facilitate the collection of statistics to computefeature values for a non-
Loopy kernel is the PolyhedralExtraction Tool, a library for extracting a polyhedralrepresentation from C source code (Verdoolaege andGrosser 2012).Additionally, the process of predicting performance viamodel evaluation can be easily separated from the modelconstruction and calibration process. Once a model hasbeen calibrated and parameter values obtained using
Perflex and
UIPiCK , if a developer has a differenttechnique to obtain statistics and feature values fortheir application kernel, or if these values are knownby design, they may use the parameter values to predictperformance independently from
Perflex and
Loopy .To accommodate legacy code in the existing system,we would also like to highlight the capability of
Loopy to ingest a dialect of Fortran 77, as described in Kl¨ockneret al. (2016); Kl¨ockner (2015).
Program representation in
Loopy relies on the OpenCLmachine model (Munshi et al. 2011) for semantics, and typical GPU hardware mappings thereof inspire themodels we use to demonstrate our approach. A very briefoverview of the OpenCL machine model will introduceterminology used throughout the following discussion.The OpenCL model considers two levels of concurrency,each explicitly exposed to the abstraction by theprogrammer in the form of a multi-dimensional grid.Each integer point in the grid is called a work-item . Arectangularly-indexed set of work-items forms a work-group , and a rectangularly-indexed set of work-groupsforms a grid (or
NDRange ), which is the unit in whichwork is expressed to the abstraction. Additionally, a sub-group is an implementation-dependent groupingof work-items within a work-group which we assumeto approximate the effective vector width with whichthe architecture issues instructions. Aside from barriersynchronization across the work-items in a work-group,individual work items are assumed to be independent.Item indices within a work-group are termed localindices or local IDs , and indices of a work-group inthe grid are termed group indices or group IDs . Wewill use symbols like lid(0) to denote the local IDalong axis 0 and gid(1) to denote the group ID alongaxis 1. We assume, in keeping with implementationsavailable in the marketplace, that work-groups areimplemented by first mapping contiguous work-itemsalong the lowest-numbered axis to adjacent SIMD lanes,subsequently to simultaneous multithreading (SMT), andfinally to sequential execution. Likewise we assume thatwork-groups are mapped to individual execution cores,potentially mapping multiple work-groups onto one coredepending on capacity. As an introduction to our modeling approach, considerthe following simple, illustrative example use case.Suppose a user wants to model and predict the executiontime of square matrix-matrix multiplication on a GPU.In this case, we will only predict execution times for asingle program variant employing an algorithm whichdivides each matrix into 16 ×
16 tiles and avoids somerepeated access to global memory by prefetching thesetiles into local memory shared between threads beforeperforming the multiplication and addition.
To construct an initial program representation of thematrix-matrix multiplication workload using
Loopy , wefirst specify the mathematical intent: knl = lp.make_kernel("{[i,j,k]: 0<=i,j,k Without any code transformations, Loopy produces asequential algorithm looping over each index, as in thefollowing OpenCL code: Prepared using sagej.cls XX(X) float acc;for (int j = 0; j <= -1 + n; ++j)for (int i = 0; i <= -1 + n; ++i){ acc = 0.0f;for (int k = 0; k <= -1 + n; ++k)acc = acc + a[n*i + k] * b[n*k + j];c[n*i + j] = acc;} To expose levels of concurrency in the form of loopsthat will become parallelized, we perform loop-splittingtransformations: knl = lp.split_iname(knl, "i", 16)knl = lp.split_iname(knl, "j", 16)knl = lp.split_iname(knl, "k", 16)knl = lp.assume(knl, "n >= 1 and n % 16 = 0") Each split iname transformation divides one loop intotwo nested loops with the inner loop iterating overthe specified number of index values (16). Withoutknowledge about the value of n , Loopy would needto add several conditional statements to prevent out-of-bounds array access. We avoid these conditionals byadding the assume transformation, yielding the followingOpenCL code: float acc;for (int j_out = 0; j_out <= ...for (int j_in = 0; j_in <= 15; ++j_in)for (int i_out = 0; i_out <= ...for (int i_in = 0; i_in <= 15; ++i_in){ acc = 0.0f;for (int k_out = 0; k_out <= ...for (int k_in = 0; k_in <= 15; ++k_in)acc = acc +a[n*(16*i_out + i_in) + 16*k_out + k_in]*b[n*(16*k_out + k_in) + 16*j_out + j_in];c[n*(16*i_out + i_in) + 16*j_out + j_in] = acc;} To parallelize loops, we tag loop indices with OpenCLthread indices: knl = lp.tag_inames(knl,"i_out:g.1, i_in:l.1, j_out:g.0, j_in:l.0") This tag inames transformation parallelizes loop(s)across thread indices as specified, i.e., "i in:l.1" parallelizes the i in loop across the lid(1) threadindex. After tagging loop indices, we obtain the followingOpenCL code: float acc;acc = 0.0f;for (int k_out = 0; k_out <= ...for (int k_in = 0; k_in <= 15; ++k_in)acc = acc +a[n*(16*gid(1) + lid(1)) + 16*k_out + k_in] *b[n*(16*k_out + k_in) + 16*gid(0) + lid(0)];c[n*(16*gid(1) + lid(1)) + 16*gid(0) + lid(0)] = acc; Finally, to help amortize data motion cost, we performprefetching transformations: knl = lp.add_prefetch(knl, "a", ["i_in","k_in"])knl = lp.add_prefetch(knl, "b", ["k_in","j_in"]) The two prefetching transformations load data from thespecified array into local memory outside of the specifiedloop, yielding the following OpenCL kernel: float acc;acc = 0.0f;__local float a_fetch[16*16];__local float b_fetch[16*16];for (int k_out = 0; k_out <= ((-16 + n) / 16); ++k_out){ barrier(CLK_LOCAL_MEM_FENCE);a_fetch[16*lid(1) + lid(0)] =a[n*(16*gid(1) + lid(1)) + 16*k_out + lid(0)];b_fetch[16*lid(1) + lid(0)] =b[n*(16*k_out + lid(1)) + 16*gid(0) + lid(0)];barrier(CLK_LOCAL_MEM_FENCE);for (int k_in = 0; k_in <= 15; ++k_in)acc = acc + a_fetch[16*lid(1) + k_in] *b_fetch[16*k_in + lid(0)];}c[n*(16*gid(1) + lid(1)) + 16*gid(0) + lid(0)] = acc; For simplicity in this example, we only predict executiontimes for a single program variant, so a very simplemathematical model may suffice. We model the executiontime as t ( n ) ≈ p madd · f madd ( n ) , (1)where n is matrix width, or, more generally a parameterwhich statically determines the size of the loop domainthat is assumed to be known at runtime; f madd ( n ) is thenumber of multiply-add (“madd”) sequences executed bythe algorithm; p madd is a hardware-dependent parameter representing the effective cost (seconds) per madd forthis program variant; and t ( n ) is execution time. Wedistinguish madd sequences from isolated multiplicationor addition operations since many GPUs provide aspecialized fused multiply-add operator. We refer toquantitative kernel characteristics used in our models askernel features . Input features , e.g., f madd ( n ), appear inthe model function, and output features , e.g., t ( n ), areproduced when evaluating a model function.Our methods for automatically producing symboliccounts like f madd ( n ) in the form of piecewise quasi-polynomials will be discussed in Section 5. Now we needa value for parameter p madd .To determine this parameter value while treating theGPU as a black-box, we run measurement computationsdesigned to reveal the appropriate cost. In this simpleexample we need the effective madd cost only for this particular program variant, so we determine this costby running the same matrix multiplication variant onvarious sizes of matrices, computing f madd ( n ) and t ( n )for each, and then calibrating our model by fitting themodel function to the feature data to compute p madd .More complex models that employ a microbenchmarkapproach to determine parameter values will be discussedin Section 8.The following code fragments demonstrate how this isaccomplished using our framework.1. Define the model shown in (1) by specifyingwall time on the Nvidia GTX Titan X GPU as the Prepared using sagej.cls tevens, Kl¨ockner output feature ( f cl wall time nvidia geforce ), andproviding the model expression. p f32madd representsparameter p madd and f op float32 madd specifies inputfeature f madd ( n ): model = Model("f_cl_wall_time_nvidia_geforce","p_f32madd * f_op_float32_madd") 2. Generate measurement kernels for model calibrationusing our parameterized collection of kernel generators , UIPiCK , which we describe in Section 7.1. To governwhich generators are used and which kernel variants areproduced by each generator, we pass filtering tags to thekernel collection identifying the desired computation anddescribing, for example, the data type of the arrays, thework-group dimensions, whether to prefetch data intolocal memory, or whether to assume that work-groupdimensions divide evenly into the corresponding arraydimensions. In this example, we generate square matrixmultiplication kernel variants: filter_tags = ["matmul_sq", "dtype:float32", "prefetch:True","lsize_0:16", "lsize_1:16", "groups_fit:True","n:2048,2560,3072,3584"]m_knls = KernelCollection(uipick.ALL_GENERATORS).generate_kernels(filter_tags) Generator filter tags , consisting of a single value, e.g., matmul sq , determine which generators run. This kernelcollection includes all UIPiCK generators, but the matmul sq tag filters out generators not executing asquare matmul. In this case, only generators with a tagmatching the user-supplied generator tag, matmul sq ,will execute, which leaves us with a single generator. Variant filter tags consist of argument:value pairs,e.g., dtype:float32 , and determine which kernelimplementation variants are produced by the generators.Each generator maintains a set of allowable values foreach argument and generates one kernel for each setof arguments in the Cartesian product of allowableargument value sets. By passing a variant filter tagwith argument values, the user can reduce the set ofallowable values for that argument. In this example, weprovide a set of four values for the problem size n anda single value for each of the remaining arguments, sofour measurement kernels will be generated. We discusskernel set generation and filtering further in Section 7.1.3. Compute input and output feature values for allmeasurement kernels: m_knl_feature_values = gather_feature_values(model.all_features(), m_knls) Here we compute the two feature values found in ourmodel, madd count and execution time, for each of thefour measurement kernels. This uses our automatic kernelstatistics gathering techniques described in Section 5.4. Fit model to feature value data, producingparameter values: model_param_values = fit_model(model, m_knl_feature_values) We describe this calibration process in Section 7.2. E x e c t i m e ( s ) Square Prefetching Matmul - GTX Titan X measuredmodelmeasurement kernels Figure 1. Measured vs. modeled execution times for squaretiled matrix multiplication with prefetching on the Nvidia GTXTitan X GPU. OpenCL/platform/driver info: OpenCL 1.2CUDA 10.1.120 (430.50) 5. Evaluate the model to predict execution time for akernel: exec_time = model.eval_with_kernel(model_param_values, test_knl, {"n": 1024}) Figure 1 displays the measured execution times andmodel predictions obtained in this example on an NvidiaGTX Titan X GPU. In this case, we sacrifice breadth ofmodel applicability to achieve very accurate predictionswith a very simple model by using measurement kernelsthat are very similar to our target computation.We can achieve different modeling results, and furtherinsight, by modifying our measurement kernel set. Forexample, if we would like to explore the component ofexecution time attributable to madd operations only,we can replace the matrix multiplication kernels in ourmeasurement set with kernels designed to measure peakmadd throughput by using the following filter tags: filter_tags = ["flops_madd_pattern", "dtype:float32","lsize_0:16", "lsize_1:16", "groups_fit:True","lid_stride_0:1", "lid_stride_1:2048","nelements:524288,786432,1048576,1310720","m:1024,1152,1280,1408"] We discuss further details of this process in Section 7.1.Calibrating the model with this measurement kernelset yields the result shown in Figure 2. Since the valuefor p madd in this model was computed using data frommeasurement kernels designed to reveal peak flops rates,these results may provide insight into the portion of thecomputation attributable to madd operations. Examplesin Section 8 will demonstrate more complex models thatuse large sets of microbenchmark measurement kernelsto determine larger numbers of parameter values, andpredict execution time for multiple kernel variants. There are three main components to our framework. UIPiCK , our parameterized collection of kernels,contains kernel generators that produce the customized Prepared using sagej.cls XX(X) E x e c t i m e ( s ) Square Prefetching Matmul - GTX Titan X measuredmodel Figure 2. Modeling component of execution time attributableto madd operations for square tiled matrix multiplication withprefetching the Nvidia GTX Titan X GPU.OpenCL/platform/driver info: OpenCL 1.2 CUDA 10.1.120(430.50) set of measurement computations used to calibrate modelparameters. A statistics gathering module, which we haveadded to Loopy , enables the automated extraction ofkernel statistics. Perflex , our performance modelingtool, enables custom model construction from user-defined parameters and kernel features, as well as modelcalibration and evaluation.Figure 3 shows the process for creating and calibratinga model. First, the user creates a model as an arithmeticexpression in terms of some subset of the available Perflex features. Second, the user provides a list offiltering tags to UIPiCK , which generates a measurementkernel set. Then Perflex gathers feature values foreach kernel in the set and calibrates the model byfitting the model function to the feature value data.This produces model parameter values, which the usercan then use, along with the model, to produce executiontime predictions for new kernels. UIPiCK generators use Loopy to create and transform measurement kernels,and Perflex uses our statistics-gathering routines whencomputing kernel feature values.We organize this contribution as follows. In Section 4,we discuss limitations of our approach and theassumptions underlying it. In Section 5, we presentour procedure for automated extraction of performance-relevant kernel statistics. In Section 6, we introduce a setof kernel features derived from these statistics that formthe basis of models constructed using our framework.In Section 7, we describe how UIPiCK produces acustomized set of measurement kernels based on user-provided filtering tags, and Perflex enables custommodel creation and calibration. In Section 8, we evaluatethe predictive power of several performance models wecreate using Perflex and UIPiCK on five GPUs fromvarious hardware generations and vendors. Lastly, wecompare our approach to related work and provide someconclusions and directions for future work. The modeling system as presented has considerablegenerality and extensibility, as the set of expressionsusable for modeling is unbounded in size, and the setof program features is user-extensible. For evaluationof our system, we focus on cost-explanatory models ofperformance, i.e. models that seek to explain executiontime of a kernel as a sum of costs, each of which is givenby an empirically determined coefficient multiplied byan operation count for a particular type of operation.In keeping with this view, we make only very limiteduse of nonlinearity. Importantly, this is a feature ofour evaluation and our preferred choice of model, notnecessarily of the system itself. It is true however that thepreference for such models has biased the set of kernelfeatures that are built into the system.We envision a number of use cases for our system. First,as shown in Section 2, it can be used to help a humanuser put timing measurements into perspective andunderstand performance characteristics of a computationon a given machine. Additionally, it can serve as apruning heuristic for an optimization procedure thatconsiders computationally (or even algorithmically)different but mathematically equivalent versions of agiven kernel. The latter use case gives rise to our keycriterion for evaluation, for which we (here, manually)produce program variants and evaluate the degree towhich our model provides correct guidance for rankingthe execution time of the kernels presented to it.Succeeding at this task would make the model an effectivepruning strategy, as it would permit us to efficiently ruleout parts of an autotuning search space without havingto rely on execution of the actual program. This does notmean that the tuning process must be entirely execution-free. In fact, as we demonstrate, important accuracygains are available if we allow for additional, on-linemeasurement runs that obtain calibration parametersfor features not thus far encountered.As a secondary objective, we evaluate the accuracywith which our models predict overall execution time,in terms of relative error. In a broad overview of theliterature, discussed in Section 9, no known performancemodel in the examined literature able to consistentlyattain better than single-digit percentages of relativeerror on this task. Thus we evaluate our models tobe roughly consistent with this standard and viewdepartures from it as a sign of potential modeling issues.Lastly, we evaluate the interpretability of our models.While, for our stated purpose, we are most interestedin the relative ranking of various program variants, wechoose execution time rather than ranking as our primarymodel output. Ranking-based approaches (e.g., Chenet al. (2018)) have seen considerable success, howeverdiscerning contributions to their prediction output posesa considerable challenge. We only consider models thathave a simple, mostly linear form in terms of potentialcost contributions. We define interpretability of ourmodels as the degree to which the calibrated modelis consistent with the cost-explanatory point of view.For example, models that require negative weights are Prepared using sagej.cls tevens, Kl¨ockner User input1. Create model Perflex2. Generate measurement kernels 3. Gather feature values for measurement kernels4. Fit model5. Predict Model() UIPiCK filter tags measurement kernel setmodel expression gather_feature_vals() m-knl feature values fit_model() model features parametersmodel param valueskernel eval_with_kernel()generate_kernels() output feature(e.g., exec. time estimate) Step Figure 3. An overview of the performance modeling process. inconsistent with the notion of ‘cost’, as carrying outadditional operations of any type should never result ina cost reduction. Taken together, these factors enhancereliability and trustworthiness of our methods.A further aspect of ensuring reliability is a statementof the conditions under which we can expect our modelsto satisfy the criteria above. An important class ofperformance effects relate to machine utilization. Forexample, the peak rate of floating point operationsfeasible on a given machine varies with the numberof cores in use, or, analogously, the number of vectorlanes used within a subgroup/warp/wavefront/SIMDvector. The sizes of available on-chip state space (register,scratchpad) may impact utilization of scheduling slots,which in turn may impact the degree to which latencymay effectively be hidden. These are examples of effectsthat our models do not (and cannot) account for.In return, the elementary cost coefficients that ourmodels obtain by the calibration process are readilyinterpretable, e.g., by comparisons with the reciprocalof memory bandwidth and peak FLOP rate.In cases of less than full utilization, our models canstill be used as the computation in question remains insteady state, essentially increasing the cost of a singleoperation to account for less-than-optimal utilization. Incases of genuinely varying utilization, the only viablepath is to shrink the modeling granularity (e.g., toa single core or SIMD lane) until the variation inutilization is no longer relevant. We note that this isnot at all an uncommon assumption. The ‘execution-cache-memory’ (ECM) family of models (Treibig andHager 2010) is based around similar considerations andcalls this a ‘speed of light’ assumption. As a simplifiedproxy for this assumption, we enforce that nearly all ofthe measurement kernels used to calibrate the modelsdemonstrated here, as well as the kernels whose executiontime we model, use work-groups of size 256. The set of features available by default in our modelingframework is the source of a further set of limitationsand assumptions that merit discussion. The majority ofthese features are based on the detection of a specifictype of operation (e.g., a floating point operation, or aspecific type of memory access) and an accounting ofthe number of times that the cost of the operation isincurred through repeated execution of the site of theoperation. To estimate the latter quantity, we assumethat the program under consideration exhibits static(i.e., non-data-dependent) control flow that is accuratelyrepresented by the polyhedrally-given loop domain. Data-dependent loop control flow could in principle be handledby allowing user-supplied ‘average’ trip counts, but wedefer this to future work. Further, any conditionalspresent in the code are accounted for by summingthe cost contribution of both branches, matching thecost behavior of GPUs under divergent control flow.All of our kernel generators accept tags allowing auser to specify assumptions, e.g., that the relationshipbetween a particular index bound and relevant work-group dimensions eliminates the need for conditionalsthat would otherwise be necessary to avoid out-of-boundsarray access. We use these to minimize the number ofconditionals in the measurement kernels and test kernels.The cost of memory access is particularly challengingto predict. Effects that may contribute to thiscircumstance include the interaction of caches andlocality (statement-to-statement as well as vector),and contention in the setting of banked memory. Wehave further observed that array sizes above a certainthreshold (e.g., 1 GiB on our AMD hardware) canseverely impact performance. We make no attemptto model the implementation details of each targetmachine’s memory subsystem. Instead, we take a two-pronged approach (Section 6.1.1). For simple types ofmemory access whose cost we expect to generalize across Prepared using sagej.cls XX(X) Algorithm 1 Determine per-kernel count of per-statement operation. for each statement S in the kernel do Compute projection π S ( D ) of loop domain D ontoset of loop indices in which S resides.Obtain symbolic count | π S ( D ) | of integer points inprojection (piecewise quasi-polynomial represent-ing the number of times S will be executed).Count operations ( n ops ,S ) occurring in singleinstance of statement S (e.g. by traversing left-and right-hand-side expressions). end for Find the overall count for the desired operation as n ops = (cid:88) Statement S | π S ( D ) | · n ops ,S . (2)programs, we offer descriptive classification by, e.g., inter-lane stride, utilization ratio, and data width. Notably,our ability to determine these is predicated on the(multi-dimensional) array subscript being quasi-affine,as they rely on polyhedrally-based reasoning. For morecomplex access patterns, we permit the memory accesscost to be measured by executing the memory accessin isolation including its loop environment, retaining anadditive accounting of cost. Our mechanism enablingthis approach will be discussed further in Section 7.1.1.Lastly, while the present implementation of theproposed system is based on OpenCL to attain vendorcoverage, it would be straightforward to realize ananalogous system for the Nvidia CUDA computeabstraction (Nickolls et al. 2008), particularly given thatLoopy is capable of generating CUDA code. The basic mathematical primitive underpinning our datagathering strategy is the ability to count the number ofinteger points in a subset of the d -dimensional integertuples Z d specified by affine inequalities connectedin disjunctive normal form (i.e., a disjunction ofconjunctions of affine inequalities). The output of thisoperation is a piecewise quasi-polynomial in terms ofproblem size parameters that may occur as part of thespecification of the set of integers. E.g., the number ofinteger points (i,j) in { p<=i Loopy codewithout a program outline . The program outline is foundautomatically by a search procedure and determinesthe ordering of statements and the nesting of loops,which enables a subsequent procedure that determinessynchronization locations. Once a program outline isobtained and barriers are placed, the counting processproceeds much as above, using the program outlineinformation to obtain the relevant set of loop indices onwhich to project. Unlike with data movement and floatingpoint operations, the resulting count is the number ofsynchronizations encountered by a single work-item.Aside from enabling automatic computation of kernelfeatures used in Perflex models, this operation-counting procedure is an independently useful tool forcode analysis and algorithm development. Prepared using sagej.cls tevens, Kl¨ockner Algorithm 2 Determine accessed index footprint F v ⊂ Z d (cid:48) for variable v .Let v be a d (cid:48) -dimensional array. for each statement S in the kernel do Compute the projection π S ( D ) of the loop domain D onto the set of loop indices in which S resides. for each access j to v in statement S do Determine the multi-dimensional index mapping I S,j : Z d → N d (cid:48) that takes a tuple of loopvariables to the accessed indices. For example,the access a[2*k+1, l+1] would have an indexmapping of I S,j ( k, l ) = (2 k − , l + 1). end forend for Find the overall accessed footprint as F v = (cid:91) Statement S ,access j I S,j ( π S ( D )) . We model execution time, or more generally any feature,as a function of user-defined parameters and other kernelfeatures, i.e., T wall ( n ) = feat out ( n ) ≈ g (cid:16) feat in0 ( n ) , . . . , feat in j ( n ) , p , . . . , p k (cid:17) , where n is a vector of integer parameters used inthe loop domain that remain constant throughout thecomputation, feat in i ( n ) accounts for the number of unitsof a particular characteristic of a kernel (e.g., the numberof single-precision 32-bit floating point multiplications), p i is a machine-dependent calibration parameter relatedto hardware behavior, and the model expression g is afunction provided by a user that is differentiable withrespect to the parameters. When creating a Perflex model, the user provides an output feature feat out ( n )and model expression . A kernel feature is a function that accepts a kernel and aset of domain parameters n and returns a (real) number.An input feature is a feature that appears in a modelexpression, like f madd ( n ) in the example model describedby (1), and an output feature is a feature produced whenevaluating a model, such as execution time. Perflex uses the operation counting approachdescribed in Section 5 to compute features, most of whichare fully parametric in the sense that once a symbolicrepresentation like f madd ( n ) has been determined froma kernel expressed in Loopy ’s internal representation,it can be cheaply reevaluated for changed valuesof the domain parameter vector n . We cache thesesymbolic representations for quick reuse, and Perflex distinguishes between situations where a cached symbolicexpression can be immediately re-evaluated using achanged n , and situations where additional processingusing the new problem size parameters is required. For example, a feature may have a characteristic specifiedusing inequality constraints involving a size parameter,e.g., a memory access with lstrides: { } . Thesymbolic expression for the number of accesses with thischaracteristic may change if n changes, so a previouslycomputed expression cannot be directly reused.When creating a model in Perflex , the user specifiesan output feature and a model equation containinginput features. Each feature is denoted by an identifierbeginning with the prefix f as shown in Section 2.2. Thefirst section of the string determines the feature class,and the remainder determines specific characteristics ofthat feature. For example, we might expand our matmulmodel to incorporate two types of memory access costsas follows: model = Model("f_cl_wall_time_nvidia_geforce","p_f32madd * f_op_float32_madd + ""p_f32l * f_mem_access_local_float32 + ""p_f32g * f_mem_access_global_float32") This model now contains one operation feature and twomemory access features. We provide a built-in set offeatures, discussed in the following sections, which wehave found sufficient to accurately model execution timein a variety of computations. Perflex users can alsocreate their own custom features. For most types ofcomputational kernels, data motion is the dominantcost. We account for this with a family of memory accessfeatures , each member of which has a set of characteristicsaffecting its cost. We refer to these characteristicscollectively as a memory access pattern ; they include • the memory type , e.g., local or global, • the direction , e.g., load or store, • the size of the data type accessed, e.g., 32-bit or64-bit, • the local and global strides along each threadaxis in the array index, i.e., strides gs0, gs1,..., ls0, ls1, ... in flattened array index array[gs0*gid(0) + gs1*gid(1) + ... +ls0*lid(0) + ls1*lid(1) + ...] with unitsequal to the size of the data type (recall that weassume these indices are affine), • and the ratio of memory accesses to accessed datafootprint (access-to-footprint ratio, or AFR ). I.e.,a ratio of 1 means every element in the footprint isaccessed one time and a ratio greater than 1 meansthat some elements are accessed more than once. Model Fidelity for Data Motion Features: Changes to any aspect of a memory access pattern mayaffect cost, particularly for global memory access, andfor this reason, we create a unique kernel feature forevery different global memory access pattern found inthe kernels whose execution time we model in Section 8.As an extreme example of the effect of access patternmodification on cost, consider the difference betweenthe patterns for matrices a and b in the example inSection 2.1: Prepared using sagej.cls XX(X) Local LoopArray Ratio strides Global strides stride a n/16 { } { 0: 0, 1:n*16 } 16b n/16 { } { } Table 1. Global load patterns in tiled matmul with prefetching. for (int k_out = 0; k_out <= ((-16 + n) / 16); ++k_out)...a_fetch[...] = a[n*(16*gid(1) + lid(1)) + 16*k_out +lid(0)];b_fetch[...] = b[n*(16*k_out + lid(1)) + 16*gid(0) +lid(0)]; The local strides and AFRs are the same for both arrays,and the loop variable strides and global strides aredifferent, as shown in Table 1.Despite these similarities, we observe that the costcan differ significantly between these two patterns. Tocompare the costs of these two reads from memory, wecreate two microbenchmark kernels, each of which readsa global array using an access pattern matching the a or b fetch pattern. We choose array sizes large enough thatoverhead and other costs not associated with the readare negligible. When we run them on the Nvidia GTXTitan X GPU varying matrix width n from 2048 to 3584,we observe an average cost per load for the b patternkernel that is consistently 4-5 times higher than that ofthe a pattern kernel. (Further platform information isavailable in Table 2.)This observation provides evidence that aspects ofGPU execution not discoverable to a black-box modelwith justifiable effort can strongly influence cost ofglobal memory access. Observe that these two accessesonly differ in their global (i.e., work-group-to-work-group) stride and the stride of the sequential loopvariable k . Analytical models would have to account formany undocumented machine details (e.g., work-groupscheduling) to account for these differences.Since it is difficult to rule out the possibility that anychange in a global memory access pattern will affectexecution cost on some hardware, we create a uniquekernel feature for every different global memory accesspattern found in the kernels whose execution time wemodel in Section 8. With this approach, a universalmodel for all kernels on all hardware based on kernel-levelfeatures like ours would need a prohibitively large numberof global memory access features and correspondingmeasurement kernels. This motivates our decision toallow proxies of “in-situ” memory accesses to be includedas features, which in turn motivates our work removalcode transformation, discussed in Section 7.1.1. Thistool facilitates generation of microbenchmarks exercisingmemory accesses matching access patterns found inspecific computations by stripping away unrelatedportions of computation in an automated fashion. Specifying Data Motion Features in the Model: To facilitate the creation of these individualized memoryaccess features, we provide the option to identify datamotion features using a unique memory access tag tomatch a particular memory access found in a kernel byname. To use this approach, we tag the global loads when creating our Loopy matmul kernel as aLD and bLD as follows, knl = lp.make_kernel("{[i,j,k]: 0<=i,j,k As we will discuss in Section 7.1.1, in addition toidentifying memory access features in the model, thesetags allow our work removal code transformation tool toselectively remove computations from a kernel in orderto create microbenchmarks exercising specific memoryaccess patterns.As a less target-kernel-specific option, we also offerthe possibility to characterize memory accesses forthe creation of memory access features by specifyingproperties of the access pattern. In this case, eachaccess matching the property criteria is included whencomputing the feature. Using this approach, we couldincorporate these matrix multiplication memory accessesinto our model as follows: model = Model("f_cl_wall_time_nvidia_geforce","p_f32madd * f_op_float32_madd + ""p_f32l * f_mem_access_local_float32 + ""p_f32ga * f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:0}_afr:>1 + ""p_f32gb * f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:16}_afr:>1 + ""p_f32gc * f_mem_access_global_float32_store") All fields after the f mem access prefix are optional,and, in the current implementation, they must beprovided in the following order: "f_mem_access_tag: While executiontime for many computations is dominated by datamovement, arithmetic operations also contribute,sometimes significantly, to overall execution time. Weaccount for these costs with a family of featuresthat count arithmetic operations. Each operation ischaracterized by the operation type , e.g., addition,multiplication, or exponentiation, and the data type ,e.g., float32 or float64. A 32-bit floating pointmultiplication operation feature, for example, couldbe specified in a model string as f op float32 mul .The models we demonstrate in this work do notinclude integer arithmetic features; in the kernel variantsmodeled, integer arithmetic is only used in arrayindex computation, a cost that can be reduced tonegligible levels by, e.g., the compiler performing commonsubexpression elimination. Prepared using sagej.cls tevens, Kl¨ockner Local barriers in GPUkernels halt execution of every thread within a work-group until all threads have reached the barrier, and cancontribute to execution time. Additionally, launchinga kernel incurs a constant overhead cost. We accountfor these costs with a family of synchronization features .Synchronization types include local barriers and kernellaunches. Recall that the statistics gathering modulecounts the number of synchronizations encounteredby a single work-item, so depending on how a userintends to model execution, they may need to multiplya synchronization feature like local barriers by, e.g.,the number of work-groups, a feature discussed in thenext section. A user might incorporate synchronizationfeatures into this model as follows: model = Model("f_cl_wall_time_nvidia_geforce","p_f32madd * f_op_float32_madd + "..."p_barrier * f_sync_barrier_local * f_thread_groups+ ""p_launch * f_sync_kernel_launch") We provide a few other built-in kernel features in Perflex . By executing OpenCLkernels containing no instructions and varying thenumber of work-groups launched, we learned that averageexecution time increases with the work-group count. Thiswas true on all five GPUs we tested, listed in Table 2.We allow Perflex models to account for this cost byproviding a thread groups feature , the total work-groupcount. We also provide an OpenCL wall time feature ,which accepts a platform , e.g., nvidia , and device , e.g., geforce , and when evaluated, executes 60 trials of thekernel on the specified device to obtain an average walltime. This feature is typically chosen as the output in ourmodel expressions. We measure kernel execution timeexcluding any host-device transfer of data.Based on the feature examples provided in thepreceding sections, a complete model might be expressedas follows: model = Model("f_cl_wall_time_nvidia_geforce","p_f32madd * f_op_float32_madd + ""p_f32l * f_mem_access_local_float32 + ""p_f32ga *f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:0}_afr:>1 + ""p_f32gb *f_mem_access_global_float32_load_lstrides:{0:1;1:>15}_gstrides:{0:16}_afr:>1 + ""p_f32gc * f_mem_access_global_float32_store + ""p_barrier * f_sync_barrier_local * f_thread_groups+ ""p_group * f_thread_groups + ""p_launch * f_sync_kernel_launch") Feature values, as discussed in the previous sections,are kernel-dependent, whereas parameter values arehardware-dependent. For example, the parameter p madd in (1), passed as p f32madd to the example Perflex model in Section 2.2, is the coefficient of the madd countin the model, representing the effective cost per madd. Identifiers of parameters in model expressions begin withprefix p followed by a unique user-defined string ofcharacters used to distinguish the parameter from others.We determine parameter values using the calibrationprocess described in Section 7. To avoid the need for machine-specific architecturaland performance knowledge, and to promote modelportability and customizablility, we treat the GPU as ablack box and determine values for model parameters byexecuting a set of measurement kernels . We provide acollection of them in a software package called UIPiCK ,which enables this customizable microbenchmarkingfunctionality. UIPiCK includes a collection of over 20 kernelcreation functions , each capable of producing multipleimplementation variants of a particular Loopy kernel.These include computations designed to exercise aparticular feature, e.g., single-precision floating pointmultiplication or a particular memory access pattern, aswell as more complex application-oriented computations,e.g., multiplying two matrices or performing matrixtransposition. Arguments passed to a creation functiondetermine which variant of a particular computation isproduced. Arguments may include, for example, threadgroup dimensions, array dimensions, data type, orwhether to perform a particular Loopy transformationlike prefetching or loop unrolling. Our transformationengine, Loopy , discussed briefly and demonstratedin Section 2, enables this automated creation andtransformation of kernels.While a user can use kernel creation functions directlyto produce Loopy kernels, UIPiCK also provides atag-driven filtering interface to facilitate production oflarge sets of measurement kernels matching specifiedcharacteristics. To do this, we provide a collection of kernel generators , each of which corresponds to a kernelcreation function. Each generator maintains a collectionof filtering tags of two varieties. Generator filter tags determine which generators areused and consist of a single value that identifies acharacteristic of the computation, e.g., matmul sq or flops mul pattern . The generators matching the user-provided generator filter tags, according to a user-provided generator match condition , will execute. Thematch condition defines the condition under which UIPiCK will consider a particular generator a matchfor the set of generator filter tags. This condition mayrequire that, for a generator to execute, (1) a generator’sfilter tag set must be identical to the user-providedtags, (2) a generator’s filter tag set must be a subset ofthe user-provided tags, (3) a generator’s filter tag setmust be a superset of the user-provided tags (default),or (4) the intersection of a generator’s filter tag setand the user-provided tags must be non-empty. Anexample below briefly illustrates how different generator Prepared using sagej.cls XX(X) match conditions may yield different sets of matchinggenerators. Variant filter tags consist of argument:value pairs, e.g., dtype:float32 , and determine specificcharacteristics of the kernel variants to be generated. Foreach argument, a generator maintains criteria defininga set of allowable values. For example, the prefetch argument might allow the set { True , False } , and the dtype argument might allow the set { float32 , float64 } .When executed, a generator generates one kernel for eachset of arguments in the Cartesian product of allowableargument value sets. By passing a variant filter tag withargument values, a user reduces the set of allowablevalues for that argument.For example, recall the filter tags used in the examplein Section 2.2: filter_tags = ["matmul_sq", "dtype:float32", "prefetch:True","lsize_0:16", "lsize_1:16", "groups_fit:True","n:2048,2560,3072,3584"]m_knls = KernelCollection(uipick.ALL_GENERATORS).generate_kernels(filter_tags) In this case, we do not provide a generator matchcondition, so the default, condition (3), will be used;in order to execute, a generator’s filter tag set mustbe a superset of the user-provided generator tags. Weprovide one generator filter tag, matmul sq , and onlyone generator in UIPiCK contains this tag, so only thisgenerator meets the match condition. If we had provideda second generator filter tag, e.g., finite diff , thenno generators would meet the match condition sinceno generator’s tag set contains both matmul sq and finite diff . If we wanted both of these generatorsto execute using such a tag set, we could instead usematch condition (4), the intersection of a generator’sfilter tag set and the user-provided tags must benon-empty. We would specify this condition by passing generator match cond=MatchCondition.INTERSECT as an argument to KernelCollection .Only one generator matches our single generator tag,so only one generator will execute. We also providea single value for each variant filter tag associatedwith that generator, except array size n , for which weprovide a set of four values. The Cartesian product ofthese sets contains four sets of argument values. Thesefour sets each contain a different value for n and areotherwise identical, so all four kernels produced willperform the same square matmul with a different valueof n . These variants will all operate on 32-bit floatingpoint data, perform a prefetching operation that fetches16 × 16 tiles from global memory, and assume thatwork-group dimensions fit evenly into array dimensions,which avoids the need for conditionals. If we were toomit, for example, the tag prefetch:True , we wouldinstead obtain 8 kernels; for each problem size UIPiCK would generate one kernel that performs prefetching andone that does not. We could include more generatorsby adding additional generator filter tags and passingthe desired generator match condition to our kernelcollection. To enable kernel-specific data motionfeatures (described in Section 6.1.1), including thecreation of microbenchmarks exercising these features,and potentially other fine-grained study of contributionsof various operation types to computational cost, weintroduce a code transformation that can extract a setof desired operations from a given computation, whilemaintaining overall loop structure and sufficient data flowto avoid elimination of further parts of the computationby optimizing compilers. We call this facility the ‘workremover’.This code transformation strips away arithmeticoperations and local memory access, i.e., on-chip work,from a kernel. It leaves behind all global memory accessesor a subset thereof, as specified by the user, helpinga developer target and analyze specific portions of anapplication. For example, it can help reveal the extentto which on-chip work and global memory access affectexecution time and inform the decision of whether touse a model that allows for overlap of on-chip and globalmemory operations, as we will discuss in Section 8.1.Additionally, this tool can aid in producing amicrobenchmark matching a particular memory accesspattern. In Section 6.1.1, we discussed the motivationfor our decision to create a unique kernel featurefor every different global memory access pattern. Asdiscussed in Section 7.1.2, we provide a generator thatautomatically constructs measurement kernels exercisingaccess patterns meeting certain criteria (patterns withAFR equal to one that are simple enough to be fullyspecified by the local strides, global strides, and datasize, i.e., patterns that do not produce a write raceand are not nested inside sequential loops). For morecomplex patterns, typically arising from a variant ofa computation we wish to model, we use generatorsemploying a subtractive, rather than additive, approachto microbenchmark creation. Using Algorithm 3, thesegenerators remove statements from the target variant,leaving behind a measurement kernel exercising thedesired memory access pattern.For example, consider our tiled matrix multiplication Loopy kernel that produced the following OpenCL code: float acc;acc = 0.0f;__local float a_fetch[16*16];__local float b_fetch[16*16];for (int k_out = 0; k_out <= ((-16 + n) / 16); ++k_out){ barrier(CLK_LOCAL_MEM_FENCE);a_fetch[16*lid(1) + lid(0)] =a[n*(16*gid(1) + lid(1)) + 16*k_out + lid(0)];b_fetch[16*lid(1) + lid(0)] =b[n*(16*k_out + lid(1)) + 16*gid(0) + lid(0)];barrier(CLK_LOCAL_MEM_FENCE);for (int k_in = 0; k_in <= 15; ++k_in)acc = acc + a_fetch[16*lid(1) + k_in] *b_fetch[16*k_in + lid(0)];}c[n*(16*gid(1) + lid(1)) + 16*gid(0) + lid(0)] = acc; To isolate the global load from b , we can use the workremover as follows: Prepared using sagej.cls tevens, Kl¨ockner Algorithm 3 Remove arithmetic and local memoryoperations. Require: Set G rm : global mem. accesses to remove.Insert statement initializing local var. tgt read = 0. for each statement S in kernel do Let G ld ( S ) be the set of global memory loads in S .Let G newld ( S ) = G ld ( S ) − ( G ld ( S ) ∩ G rm ). for each memory access g newld in G newld ( S ) do Insert statement tgt read = tgt read + g newld . (keep dependencies of S ; add dep. on tgt read init.) end forif S contains a global store g st / ∈ G rm then Insert statement g st = tgt read . (keep dependencies of S ; add dep. on tgt read init.) end if Remove statement S . end forif tgt read is never written to global memory then Create new global variable tgt destread with one entryper work-item.Insert statement tgt destread = tgt read after all state-ments that modify tgt read , writing tgt read tothe local work-item’s entry in tgt destread . end if Infer type for tgt read and tgt destread based on G newld ( S ) knl = remove_work(knl, remove_vars=["a", "c"]) Following Algorithm 3, this function removes the localmemory transactions, as well as global memory accessesto variables a and c , producing the following OpenCLkernel: float read_tgt;read_tgt = 0.0f;for (int k_out = 0; k_out <= ((-16 + n) / 16); ++k_out)read_tgt = read_tgt +b[n*(16*k_out + lid(1)) + 16*gid(0) + lid(0)];read_tgt_dest[16*n*gid(1) + n*lid(1) + 16*gid(0) + lid(0)] = read_tgt; Observe that the access pattern to b is unchanged.We include the seemingly unnecessary global store to read tgt dest to ensure that statements with unusedresults are not dropped by an optimizing compiler. Tomaximize model accuracy, this store will also need tobe represented by a feature in the model; it uses astraightforward-to-model, stride-1 access pattern. Wealso include writes to global memory matching thispattern in our measurement kernels for arithmeticoperation and local memory access (discussed in thenext section) for the same reason.The code resulting from this transformation includesa tight dependency chain on the increment of read tgt .These dependencies have the potential to impact themeasurements. We leave further investigation of thismatter to future work. To allow developersto create custom sets of microbenchmarks on which tocalibrate their models, UIPiCK allows users to choosefrom generators that we provide, and developers mayalso create custom measurement kernel generators to fit a specific purpose. In support of this work, we havecreated a set of microbenchmark measurement kernels,each designed to measure the cost of a single operation(as represented by a feature, see Section 6.1). Thesekernels make up the measurement kernel sets used tocalibrate the models demonstrated in Section 8, where wedescribe a rigorous evaluation procedure to characterizethe achievable modeling fidelity of the proposed system.One important technique in designing these kernels isto, e.g., vary the quantity of a single feature whilekeeping other feature counts constant. Descriptions of afew fundamental examples of these microbenchmarkingkernels follow. Global memory access: We employ two varietiesof measurement kernels designed to exercise globalmemory access. First, for access patterns with AFRequal to one that are simple enough to be fully specifiedby the local strides, global strides, and data size,i.e., patterns that do not produce a write race andare not nested inside sequential loops, we provide agenerator that automatically constructs measurementkernels exercising the desired pattern, which is specifiedvia the user-provided variant filter tags. In thesekernels, each work-item performs a global load fromeach of a variable number of input arrays using thespecified access pattern. Each work-item then storesthe sum of the input array values it fetched in a singleresult array. For example, with two load arrays, thiskernel would perform the operation result[pattern]= in0[pattern] + in1[pattern] . Variant filter tagsprovided for these kernels specify the data type, globalmemory array size, work-group dimensions, number ofinput arrays, and thread index strides.Second, to generate measurement kernels exercisingmore complex access patterns, e.g., memory accessesinside nested loops or with AFRs not equal to one, weprovide generators that use the work removal approachdescribed in Section 7.1.1 to create microbenchmarkkernels exactly matching these more complex patterns.These generators first construct the original applicationkernel that contains the desired memory access pattern,and then, using the process described in Section 7.1.1,strip away operations until the desired memory accessremains. These generators tag memory accesses in thesekernels using Loopy memory access tags and we usethese memory access tags to identify the desired accesspattern in the corresponding Perflex model feature,as shown in Section 6.1.1. Variant filter tags provided tothese kernel generators include all filter tags used whenspecifying the original application kernel variant, as wellas filter tags specifying whether to remove on-chip workand filter tags specifying which, if any, global memoryaccesses to remove. Arithmetic operations: In measurement kernelsdesigned to exercise an arithmetic operation, we firsthave each work-item initialize 32 private variables ofthe specified data type. We then have it perform aloop in which each iteration updates each variable usingthe target arithmetic operation on values from othervariables. We unroll the loop by a factor of 64 and arrangethe variable assignment order to achieve high throughput Prepared using sagej.cls XX(X) using the approach found in the Scalable HeterOgeneousComputing (SHOC) OpenCL MaxFlops.cpp benchmark(Danalis et al. 2010). In this approach, the 32 variableupdates are ordered so that no assignment depends onthe most recent four statements. In our experience, using32 variables permits peak performance to be attainedby avoiding tight dependency chains without losingperformance due to underutilization of scheduler slotsdue to register space consumption.After the loop completes, we sum the 32 variablevalues and store the result in a global array accordingto a user-specified memory access pattern. As discussedin Section 7.1.1, we include these global stores to ensurethat statements with unused results are not dropped byan optimizing compiler. These generators accept globalaccess patterns of the simple variety described above forwhich we can construct kernels in an automated fashion.Variant filter tags provided for these kernels includethe data type, global memory array size, work-groupdimensions, thread index strides for the global memoryaccess pattern, and number of loop iterations. Local memory access: In measurement kernelsdesigned to exercise access to OpenCL local memory,i.e., access to the per-core shared scratchpad, each work-item initializes one element of a local array to thedata type specified. We then have it perform a loop,at each iteration moving a different element from onelocation in the array to another. We avoid write-racesand simultaneous reads from a single memory location,and use an lid(0) stride of 1, avoiding bank conflicts.After the loop completes, each work-item writes onevalue from the shared array to global memory. Whileour framework allows local memory access features tobe characterized by thread index strides, we do not usethese strides to differentiate local memory accesses in thedemonstrations presented here. Instead, we use a singlefeature for all 32-bit local memory accesses occurringin measurement kernels and modeled kernels. Variantfilter tags provided for this kernel include the data type,global memory array size, iteration count, and work-group dimensions, which determine the local strides forlocal memory access as well as the size of the local array. Other features: When creating the measurementkernel sets used to calibrate the models demonstrated inSection 8, we also generate variants of a measurementkernel that executes a variable number of local barriers,a measurement kernel similar to that described inSection 7.4 that reveals operation overlapping behavior,and a measurement kernel that launches a specifiednumber of work-groups performing no operations. Weset problem sizes to attain execution times between1 and 1000 milliseconds, with the exception of theempty kernel generator, which produces some kernelslaunching as few as 16 work-groups in order to reveal thekernel launch overhead. Using a sufficiently high-fidelitymodel, we expect that users will be able to differentiatebetween latency-based costs of a single kernel launchand throughput-related costs that would be incurred inpipelined launches. After creating a model and generating a measurementkernel set, we collect feature values for each measurementkernel and then fit the model function to the data byminimizing the Euclidean norm of the residual in thenonlinear least squares problemmin p || g ( p ) − t || , where the residual is defined as r ( p ) = t − g ( p )with r k ( p ) = t k − g k ( p ) ( k ∈ { , . . . , l } ) , p = ( p , . . . , p m ) T , and g = ( g , . . . , g l ) T , t = ( t , . . . , t l ) T . Here, l is the number of measurement kernels, m isthe number of model parameters, p i is the i th modelparameter, g k is the model function containing featurevalues for the k th measurement kernel, and t k is theoutput feature value for the k th measurement kernel.Thus, the resulting nonlinear system contains one row foreach measurement kernel. Solving this problem involvesevaluating the Jacobian: J ( p ) : j ki = ∂g k ∂p i ( p )After using symbolic differentiation to obtain theJacobian, we provide the Jacobian and residual to Scipy ’s optimize.leastsq function to solve thenonlinear system.If the user is concerned about prediction error relativeto execution time, rather than absolute prediction error,they may call scale features by output() on thefeature data before calibrating the model, which divideseach input feature value by the corresponding outputfeature value and sets output feature values to 1. Weperform this scaling in all examples discussed in thiswork.After calibrating the model, Perflex logs the leastsquares residual evaluated at the solution. This valuecan be examined by a user and may aid in assessingtheir model expression; if calibrating a particular modelto a particular workload results in a high residual, thismay indicate that the model is not appropriate for theworkload. We discuss other strategies to determine theappropriateness of a model in Section 8.1. After obtaining model parameter values using thecalibration process just described, we can compute apredicted output feature (e.g., execution time) for a Loopy kernel. This requires a dictionary of kernelarguments, i.e., problem size variable values, to computethe kernel feature values, which are used along with theparameter values to evaluate the model as demonstratedin the example in Section 2.2: model_param_values = fit_model(model, m_knl_feature_values)exec_time = model.eval_with_kernel(model_param_values, test_knl, {"n": 1024}) Prepared using sagej.cls tevens, Kl¨ockner Approximating Step Function with tanh() s ( x ) s ( x ) p edge = 10 Figure 4. Approximating s ( x ) with differentiable ˆ s ( x ) . Model evaluation cost is primarily driven by theevaluation of piecewise quasi-polynomials as well as themodel expression. As a GPU schedules subgroup execution, data movementbetween global memory and the GPU die may overlapwith on-chip operations like arithmetic. We demonstratehow this nonlinear relationship between cost componentsand overall cost can be accommodated in a modelexpression in the proposed system.If on-chip operations and global memory transactionsoverlap completely, execution time may be approximatedas t ≈ max( c gmem , c on-chip ) , (3)where c gmem and c on-chip are the time costs of globalmemory transactions and on-chip operations, respec-tively. Since Perflex models must be differentiable,we cannot use this approach directly. Instead, we usea differentiable function ˆ s ( x ) approximating the stepfunction s ( x ) = (cid:40) x < , x ≥ , (4)to approximate the model described by (3) as t ≈ c gmem · ˆ s ( c gmem − c on-chip )+ c on-chip · ˆ s ( c on-chip − c gmem ) . (5)Here, ˆ s ( x ) serves as a switch, “turning on or off”the global memory and on-chip cost components,depending on which is greater. If c gmem > c on-chip ,then ˆ s ( c gmem − c on-chip ) ≈ s ( c on-chip − c gmem ) ≈ t ≈ c gmem . Alternatively, if on-chip costsdominate, t ≈ c on-chip .To approximate the step function, we useˆ s ( x ) = (tanh( p edge · x ) + 1) / , (6)where p edge is a parameter determining the “abruptness”of the step. It is determined along with the otherparameters during model calibration. As p edge increases,ˆ s ( x ) becomes more similar to s ( x ). Variations of(6) with additional parameters could model partial overlap between global memory transactions and on-chip operations as well. Figure 4 shows a comparisonbetween s ( x ) and ˆ s ( x ) with p edge = 10.To demonstrate the effectiveness of this technique, wecreate a measurement kernel in which we can vary theratio of local to global memory accesses. In this kernel, each thread performs one 32-bit global load, followedby m m , the ratio oflocal to global memory accesses, we control whether theexecution time is dominated by global or local memorytransactions.When m is small, on-chip costs may be hidden behindglobal memory transactions; as m increases, eventuallylocal memory transactions dominate execution time. Wemodel this using a Perflex model based on (5), with c gmem and c on-chip being expressions containing Perflex features and parameters representing the global and localmemory access costs. Figure 5 displays how Perflex calibrates such a model based on this data. We observethat the extent to which local memory transaction costsin this kernel are hidden behind global transaction costsvaries significantly across machines. On the Nvidia TeslaK40c and Nvidia Tesla C2070 GPUs, very little, if any, ofthe local access cost is hidden, while on the Nvidia TitanV, Nvidia GTX Titan X, and AMD Radeon R9 FuryGPUs, the cost of anywhere from 4 to 12 local memoryaccesses can be hidden behind a global transaction.We conclude that, at least for this kernel, on-chip andglobal memory operation overlap behavior varies acrossGPUs, and that a Perflex model based on (5) canmodel this behavior. We will discuss results for anotherkernel variant where this overlap behavior varies acrossGPUs in Section 8.4. OpenCL version and other platforminformation is available in Table 2. To demonstrate our approach, we create and calibratemodels for three applications, with each model designedto predict execution times for multiple variantsof a particular computation. These computationsinclude a matrix-matrix multiplication (two variants), aDiscontinuous Galerkin (DG) differentiation operation(four variants), and a 2-D five-point finite differencestencil computation (two variants). Using UIPiCK ,we produce a cost-analytic measurement kernel set foreach model. This process decomposes the computationalcost incurred into individual components calibratedby microbenchmarks which combine to model andthereby explain the computational cost of the kernelunder investigation. By conducting measurements of ouranalytical microbenchmarks on the five GPUs in Table 2,we obtain calibrated models on each platform which wethen evaluate for predictiveness and accuracy.To obtain average execution times, we evaluate ourOpenCL wall time feature, discussed in Section 6.1.4.On the AMD Radeon R9 Fury GPU, we observedthat anomalous execution times on the order of 10 × higher than a variant’s usual execution time can occuroccasionally, seemingly at random, and we exclude theseevents from our data.We compare execution times predicted by the modelswith measured execution times in Figures 7, 8, and 9, andreport the geometric mean of relative error for reasonslaid out by Fleming and Wallace (1986). Prepared using sagej.cls XX(X) E x e c t i m e ( m s ) Error: 1.9% Volta Titan V Error: 1.5% GTX Titan X Error: 1.1% Tesla K40c E x e c t i m e ( m s ) Error: 2.8% Tesla C2070 Error: 1.7% Radeon R9 Fury MeasuredPredicted Figure 5. Modeling overlap of local and global memory transactions. Geometric mean of relative error displayed. Array size differsacross GPUs. GPU (Generation) OpenCL/Platform/Driver Info Nvidia Titan V (Volta) OCL 1.2, CUDA 10.0.246 (410.93)Nvidia GTX Titan X (Maxwell) OCL 1.2, CUDA 10.0.292 (410.104)Nvidia Tesla K40c (Kepler) OCL 1.2, CUDA 9.1.84 (390.87)Nvidia Tesla C2070 (Fermi) OCL 1.2 CUDA 9.1.84 (390.116)AMD Radeon R9 Fury (GCN 3) OpenCL/ROCm 1.2.0-2019020110ROCm platform HSA runtime 1.1.9-49-g39f1af5Kernel 4.19 Table 2. Platforms used for evaluation in Sections 7.4 and 8. In the models we consider in this evaluation, wecategorize workload costs as • c gmem : global memory access, • c on-chip : on-chip work, i.e., local/scratchpadmemory access and arithmetic, and • c overhead : barrier, kernel launch, and work-grouplaunch costs.We model each of these three cost componentsindividually as a sum of kernel features weighted by costparameters, with barrier cost modeled on a per-work-group basis via the strategy described in Section 6.1.3.We evaluate two different types of models, a linear model, t ≈ c overhead + c gmem + c on-chip , (7)and a nonlinear model that allows overlap of on-chipand global memory operation costs, t ≈ c overhead + c gmem · ˆ s ( c gmem − c on-chip ) + c on-chip · ˆ s ( c on-chip − c gmem ). (8)In general, the extent to which on-chip operation costsare hidden by global memory transactions varies betweenkernels and across architectures. To determine whetherthe extent of this overlap warrants the nonlinear modelexpressed in (8), we apply multiple strategies.First, before building and calibrating a performancemodel, we use the work removal routine discussed in Section 7.1.1 to remove arithmetic and local memoryaccesses from a kernel, obtaining execution times for aversion of the kernel containing only global memorytraffic. We then estimate the cost of the removedon-chip operations using the costs revealed by ourmicrobenchmark kernels. If the sum of these two separatecosts is approximately equal to the total execution timefor the original kernel, this suggests little to no overlap.However, if the sum of these separate costs is significantlygreater than the total execution time, this serves asevidence that on-chip costs are being hidden.To gain further confidence in the presence or absenceof this overlap, we can build and calibrate both kindsof execution time models and observe the results. Whenwe use the linear model to predict execution timesfor a kernel exhibiting overlap of on-chip costs thatare significant relative to the total kernel executiontime, we observe inflated predictions of execution time,sometimes by a significant factor as will be discussedin Section 8.3. We observe the opposite result whenapplying the nonlinear model to a kernel where the costof on-chip work is large relative to total execution timeand very little of this cost is hidden. The developmentof an a-priori criterion that captures the extent ofoverlap would streamline model selection and improvethe predictiveness of our evaluation models. We leavethis for future work.As mentioned in Section 7.2, the least squares residualcan also serve as an indicator of model appropriateness,aiding in the selection of a linear or nonlinear model. The measurement kernel sets used to calibrate ourmodels, unlike the set used in the simple example inSection 2.2, employ a microbenchmarking approach anddo not include the computation whose execution times weare predicting. Each microbenchmark kernel is designedto reveal the cost associated with a single kernel feature. Prepared using sagej.cls tevens, Kl¨ockner The design of these measurement kernels was outlinedin Section 7.1.2. We use the following notation whenreferring to features: f -mem/op type { lid strides }{ gid strides } < data type > [AFR](memory access tag) To denote a measurement kernel exercising a particularfeature, we substitute the prefix k- for the prefix f- . The mem/op type field categorizes the feature as aglobal or local memory access, arithmetic operation, localbarrier, kernel launch, or work-group launch. For datamotion features, the stride , data type , and AFR fieldsdescribe the access pattern characteristics introducedin Section 6.1.1. The memory access tag field describesa memory access specified by a memory access tag asdescribed in Sections 6.1.1 and 7.1.2.Figure 6 shows which measurement kernels we use tocalibrate each model, as well as the features present ineach kernel. Our first evaluation model predicts execution times fortwo variants of square matrix-matrix multiplication. Thefirst variant, which is often used as an introductoryexample in teaching GPU programming, is the same asthe one discussed in Section 2.1. It prefetches tiles of thetwo input matrices into local (shared) memory beforeperforming arithmetic. The second is the same algorithmwithout any prefetching, and without splitting of the k summation loop. The prefetching variant achievesbetween 8% and 20% of peak FLOP/s rates on all fiveGPUs. It is important to clarify the relationship betweenthese measurements and the validity assumptions (inparticular, on machine utilization) set forth in Section 4.The utilization assumption does not entail that all kernelsto which our modeling approach is applicable mustalready achieve peak performance, as this would notreflect the typical use case of performance modeling.Instead, the assumption exists to highlight potentialsources of model inaccuracy or performance shortfall, asrevealed by the model.Both variants of our matrix-matrix multiplicationevaluation case operate on 32-bit floating point data anduse 16 × 16 work-groups. Together, the two variants usefive distinct global memory access patterns, as shown inFigure 6b. We model execution time using the nonlinearmodel expressed in (8). The model features comprising c gmem , c on-chip , and c overhead , as well as the measurementkernel set, are shown in Figure 6.Table 3 displays the model parameter valuesrepresenting feature costs on the Nvidia Titan V GPU,as well as p edge from (6), whose value is determined alongwith the feature cost parameters during the calibrationprocess. Recall that these values aim to represent effective costs at maximum throughput. The units of work whosecost we model are determined by the modeled costgranularities (MCGs) listed; each operation cost modeledis assessed per work-item (WI), sub-group (SG), work-group (WG), or kernel (K). We introduced these OpenCLmachine model concepts in Section 1.2, and discussed Param.Feature val. (s) MCG Rate f32 add 5 . − 12 SG 5 . . − 12 SG 5 . . − 12 SG 12 . f -lmem 12 SG 1 . f -gmem { ,> }{ ,> } 12 WI 1 . f -gmem (mm-PF-b) . − 12 WI 8 . f -gmem (mm-PF-a) . − 12 WI 2 . f -gmem (mm-noPF-b) . − 13 WI 4 . f -gmem (mm-noPF-a) . − 11 SG 6 . . − 13 WG Peak . . launch group 1 . − 09 WGlaunch kernel 7 . − 05 K( p edge ) 1 . Table 3. Matrix multiplication model parameter values on theNvidia Titan V GPU. Parameter values represent costs per unitaccording to granularities listed, with the exception of ‘overlapedge’, which is the parameter governing the sharpness of ourstep function approximation, p edge in Equation 6. Modeled costgranularity (MCG): work-item (WI), sub-group (SG),work-group (WG), or kernel (K). Peak values obtained fromWikipedia contributors (2019). their relevance to operation counting in Section 5. Notethat we count f -gmem (mm-noPF-a) once per sub-group(32 work-items) rather than once per work-item becausethe lid(0) stride is 0, as discussed in Section 5. Thetable also displays a throughput calculated based oneach parameter value and corresponding MCG.In this example set of parameter values, we observesimilar costs for addition, multiplication, and multiply-add operations, as we expect, and that the localmemory access cost is about twice that of the arithmeticoperations. We also observe that the throughputs impliedby the arithmetic operation costs match the peak 32-bitflop rate for the hardware very closely. Note that thepeak hardware flop rate listed here assumes multiply-addoperations are counted as two operations.The accessibility of these parameter values and thetransparency of the costs they represent can facilitateunderstanding of the factors affecting performancein these kernels. For example, by comparing thedata throughput for mm-PF-a and mm-PF-b to thethroughputs for mm-noPF-a and mm-noPF-b , weobserve that prefetching increases the effective cost peritem loaded from global memory by factors of about 3and 5, respectively. This suggests the overall cost savingsdue to prefetching is primarily due to the reduction intotal number of global memory accesses by a factor ofthe tile width, 16, rather than by a reduction in cost ofindividual memory accesses.Since the access-to-footprint ratios for matrices a and b are significantly greater than one ( n , for thenon-prefetching case), it is not unreasonable that theapparent data throughput is greater than the hardwarepeak. Elements of these arrays may be reused ratherthan re-fetched from global memory due to, e.g.,hardware caching, inflating the calculated throughput. Prepared using sagej.cls XX(X) MatmulDGFD k -gmem { , }{ , · } MatmulDGFD k -gmem { , }{ , · } 16 local memory stores thatare not performed in the non-prefetching variant, and 16times fewer global memory loads, is successfully hidingthe cost of local memory transactions and arithmeticoperations behind global memory transactions, and thatthe nonlinear model is a good choice for this computationon these architectures. Results of the on-chip-cost-hidinganalysis described in Section 8.1 are consistent with thisinterpretation. The Discontinuous Galerkin (DG) finite element methodis a numerical method for the approximate, high-orderaccurate solution of boundary value problems of wave-like(hyperbolic) PDEs such as Euler’s or Maxwell’s equations,often used in complex geometry on unstructured meshes.Our second evaluation case models execution times forfour variants of an element-wise differentiation of per-element polynomials used in a DG computation. Thepre-transform Loopy kernel shows the mathematicaloperation performed by all of these variants: knl = lp.make_kernel("{[m,k,i,j]: 0<=m For a single differentiation matrix diff mat (i.e., nmatrices = 1), this operation can be viewed as amatrix-matrix multiplication where a small squarematrix is multiplied by a ‘short and wide’ element matrix u . The primary differences between this computationand that of the previous example are (1), typically, nunit nodes << nelements , yielding small diff mat matrices and a ‘short and wide’ u , and (2), since nmatrices > 1, the operation on u is performed multipletimes, providing opportunities for data reuse.All four variants tile and parallelize the k and i loops: knl = lp.split_iname(knl,"i", lsize[1], outer_tag="g.1", inner_tag="l.1")knl = lp.split_iname(knl,"k", lsize[0], outer_tag="g.0", inner_tag="l.0") The first variant performs only these transformationsand does not utilize local memory for data reuse.The second variant prefetches lsize[0] × lsize[1] (16 × 16) tiles from u into local (shared) memory beforeperforming arithmetic: ... knl = lp.split_iname(knl, "j", lsize[1])knl = lp.add_prefetch(knl, "u", ["k_in", "j_in"])knl = lp.fix_parameters(knl, nmatrices=nmatrices)knl = lp.add_inames_to_insn(knl, "i_out", "id:*fetch*")knl = lp.realize_reduction(knl)knl = lp.privatize_temporaries_with_inames(knl,"m")knl = lp.duplicate_inames(knl, "m", "id:*init*")knl = lp.duplicate_inames(knl, "m", "writes:res")knl = lp.prioritize_loops(knl, ["j_out", "j_in", "m"]) These additional transformations tile the j loop,load parts of u into scratchpad (local) memory,and restructure the loops to expose instruction-levelparallelism.The third variant instead prefetches lsize[0] × lsize[1] tiles from the differentiation matrix diff mat into local memory before performing arithmetic: ... knl = lp.split_iname(knl, "j", lsize[0])knl = lp.add_prefetch(knl, "diff_mat", ["j_in","i_in"])knl = lp.prioritize_loops(knl, ["m", "j_out", "j_in"]) The fourth variant uses the same transformations asthe third; however, we also transpose the memory layoutof the element data: knl = lp.tag_data_axes(knl, "u", "N0,N1")knl = lp.tag_data_axes(knl, "res", "N2,N0,N1") These transformations change the nesting order of thedata axes, N0 and N1 , which changes the global memoryaccess patterns for u and res so that the stride of lid(0) is 1 instead of nunit nodes . This significantlyimproves the performance of these loads. As a result, thelast variant is the fastest in all our measurements, andachieves between 5% and 18% of peak FLOP/s rates onall five GPUs.All four variants operate on 32-bit floating point dataand use 16 × 16 work-groups. We set nmatrices to 3, nunit nodes to 64, and nelements varies as shown inFigure 8. The four variants and the measurement kernelset used to calibrate their model use 11 distinct globalmemory access patterns as shown in Figure 6b.To decide whether to use a model that allows foron-chip cost hiding, we apply the analysis describedin Section 8.1 to each of the DG variants. The resultssuggest that on-chip work overlaps with global memorytransactions, with one exception: the u -prefetchingvariant does not exhibit this overlap on the Nvidia TitanV, Nvidia Tesla K40c, and Nvidia Tesla C2070 GPUs.On these three GPUs, the total execution time for the u -prefetching variant is approximately the sum of theon-chip and global memory operation costs. Because ofthis, we use the linear model expressed in (7) to modelthe u -prefetching variant on these three GPUs, and inall other cases, we model the DG variants using the nonlinear model expressed in (8). The model featurescomprising c gmem , c on-chip , and c overhead , as well as themeasurement kernel set, are shown in Figure 6. Recallthat in Section 7.4 we observed that the kernel whichallowed us to vary the ratio of local to global memoryaccesses also did not exhibit overlap on the Nvidia TeslaK40c and Nvidia Tesla C2070 GPUs. Prepared using sagej.cls XX(X) E x e c t i m e ( m s ) Error: 7.3% Volta Titan V Error: 3.1% GTX Titan X Error: 1.1% Tesla K40c 768 1,280 1,792 2,304n (matrix width) E x e c t i m e ( m s ) Error: 4.6% Tesla C2070 Error: 11.5% Radeon R9 Fury Variant TiV TiX K40 2070 FuryPF NoPF Mean Prefetch No prefetchMeasured Predicted Figure 7. Matrix multiplication model accuracy. The table displays the geometric mean of relative error (%). Figure 8 compares modeled to measured executiontimes for the four variants on five GPUs, and displaysthe relative error in model predictions. Across all casesthe geometric mean of relative error is 7.5%. On allfour Nvidia GPUs the model predictions are sufficientto accurately rank execution times for all variantsfrom highest to lowest. On the AMD Radeon R9 FuryGPU, while the model predictions would rank the twofastest variants in reverse order, the predicted executiontimes are accurate enough to narrow the space ofpotential variants to the two fastest options, whoseexecution times differ by less than 7%. Additionally, thepredictions accurately reveal the cost savings realizedby the diff mat -prefetching variant when operating onelement data with a transposed memory layout. In our third evaluation case, we model execution timesfor two variants of a 2-D five-point finite difference stenciloperation. The pre-transform Loopy kernel shows themathematical operation carried out by both transformvariants: knl = lp.make_kernel("{[i,j]: 0<=i,j Both variants parallelize the i and j indices acrossthreads and prefetch lsize[0] × lsize[1] tiles from u into scratchpad (local) memory before performingarithmetic: knl = lp.split_iname(knl,"i", lsize[1]-2, outer_tag="g.1", inner_tag="l.1")knl = lp.split_iname(knl,"j", lsize[0]-2, outer_tag="g.0", inner_tag="l.0")knl = lp.add_prefetch(knl,"u", ["i_in", "j_in"], fetch_bounding_box=True)knl = lp.tag_inames(knl, "u_dim_0:l.1, u_dim_1:l.0") The difference between the two variants is the work-group size, which is also the size of the tiles fetched into local memory. This change substantially affects theglobal memory access patterns. For the first variant,we use 16 × 16 work-groups, and for the second, weuse 18 × 18. With 16 × 16 work-groups, 16 × × 14 elementsis computed by one of 196 threads, while 60 threadsremain idle, corresponding to the 60 halo elements. Thisstrategy yields a gid(0) stride of 14. With 18 × 18 work-groups, 18 × × 16 elements is computed by one of 256 threads,while 68 threads remain idle, corresponding to the 68halo elements. This strategy yields a gid(0) stride of16.Unlike the other variants we model, the global memoryloads in these kernels have access-to-footprint ratios nearone. Because of this, the data throughput, calculated as(total global memory access count) / (execution time), isless likely to be inflated significantly by cached datareuse, and is meaningful to report. We observe that the16 × 16 variant is slightly faster, and achieves between40% and 82% of peak bandwidth on all five GPUs. Theeffective FLOP/s rates are between 2% and 5% of peak.The analysis of on-chip cost hiding that we described inSection 8.1 indicates that little if any such overlap occurswhen executing these variants on the architectures usedin our experiments. Because of this, we model executiontime using the linear model in (7). The model featurescomprising c gmem , c on-chip , and c overhead , as well as themeasurement kernel set, are shown in Figure 6. As shownin Figure 6b, the variants and the measurement kernelsused to calibrate the model are based on five distinctglobal memory access patterns. Both variants operateon 32-bit floating point data.We note two potential sources of modeling error inthis example. As mentioned in Section 7.1.2, the modelspresented here use a single feature to represent all localmemory accesses. We made this decision to simplify Prepared using sagej.cls tevens, Kl¨ockner E x e c t i m e ( m s ) Error: 5.9% Volta Titan V Error: 10.5% GTX Titan X Error: 7.7% Tesla K40c E x e c t i m e ( m s ) Error: 11.3% Tesla C2070 Error: 4.5% Radeon R9 Fury Variant TiV TiX K40 2070 FuryPFdm PFdm T PFu L L L NoPF Mean No prefetch Prefetch diff matPrefetch u Prefetch diff mat T Measured Predicted Figure 8. DG differentiation model accuracy. The table displays the geometric mean of relative error (%). T Element datatransposed. L Linear model used (otherwise nonlinear). the models and measurement kernel sets; it is not alimitation of our approach or our software, since localmemory access features may include the same accesspattern characteristics as global memory access. Thelocal memory accesses in these kernels constitute asignificant portion of the execution time, 10-20%, andhave different access patterns across variants. The lid(0) stride is one in both variants; however, the lid(1) strideis 18 in the 18 × × × Common approaches to cost prediction include analyticalmodeling (often based on in-depth program and hardwareanalysis), statistical regression and machine learning approaches, and machine and application benchmarking.Many approaches use a combination of these strategies.Among analytical approaches to GPU performancemodeling, much of the previous work yielding themost accurate predictions has focused on constructingmodels of instruction-level execution based on detailedhardware knowledge and instruction analysis, oftenfor a single architecture or group of highly similararchitectures. Many of these models predict well fortheir specific target architecture. For example, Hongand Kim (2009) present an analytical performancemodel for Nvidia GPU architectures that producesan execution time prediction based on estimates ofmemory-level and thread-level parallelism. They furtherextend their model for power prediction (Hong andKim 2010). This model achieves a geometric meanerror of 13.3% when predicting performance of theMERGE (Linderman et al. 2008) benchmarks on fourTesla generation Nvidia GPUs. It makes extensive use ofhardware performance characteristics, such as timingdelays between memory transactions, DRAM accesslatency, and instruction execution cycles, and it requiresan analysis of PTX assembly instructions. Baghsorkhiet al. (2010) also use deep analytical knowledge of a(single) GPU architecture, and, unlike Hong and Kim(2009), model branch divergence, bank conflicts, andSIMD pipeline delays.Spafford and Vetter (2012) approach analyticalmodeling of exascale applications by introducing adomain specific language for defining abstract applicationand machine models. To construct an application model,users provide a structured description of applicationcharacteristics, including parameterized expressions forflop counts, load counts, communication volume and type(e.g., all-to-all vs. allgather), and the number of parallelunits, as well as information on control-flow. To constructan abstract machine model, users provide a structureddescription of machine characteristics, including memory Prepared using sagej.cls XX(X) E x e c t i m e ( m s ) Error: 4.7% Volta Titan V Error: 11.1% GTX Titan X Error: 7.1% Tesla K40c E x e c t i m e ( m s ) Error: 6.4% Tesla C2070 Error: 4.5% Radeon R9 Fury Variant TiV TiX K40 2070 Fury16x16 Mean Figure 9. Finite difference model accuracy. The table displays the geometric mean of relative error (%). clock rate, bus width, core clock rate, SIMD width, andpresence of an FMA instruction, as well as informationon the machine interconnect. These abstract models canthen be consumed by other applications to analyze thecode and its performance.Other related analytical modeling contributionsinclude works by Hammer et al. (2017), who use apartially automated analytical approach to modelingCPU loop kernel performance that allows for multiplearchitectures, Van Gemund (2003), who uses a partiallyautomated symbolic analytic modeling approach topredict performance on distributed CPU machines,Pllana and Fahringer (2005), who provide a graphicaluser interface to aid distributed CPU model creationand employ discrete event simulation, and Unat et al.(2015), who introduce a tool employing compiler analysisto generate parameterized models with the slightlydifferent goal of evaluating design trade-offs and softwareoptimizations. All four of these analytical approachesrequire a user-supplied machine model or architecturestatistics.Machine learning and statistical techniques are alsoused to predict performance of GPU kernels. Fromthe perspective of optimization selection, Cavazos et al.(2006) present a probabilistic predictor of transformationselection using a non-analytical, black-box model basedon an artificial neural network. Joseph et al. (2006)use techniques from machine learning to identifypiecewise nonlinearities in cost metrics. Other approachesemphasize the performance of single subsystems, suchas branch prediction (Emer et al. 2002). Other learningand statistical approaches include Jia et al. (2012), Kerret al. (2012), Wu et al. (2015), Zhang et al. (2011), Chenet al. (2018), and Gysi et al. (2019).Some modeling approaches employ benchmarks,including Zhang and Owens (2011), who use resultsfrom microbenchmarks to derive a throughput modelfor instruction pipeline, shared memory, and globalmemory costs. They focus on identifying performance bottlenecks and guiding the optimization processrather than predicting execution time. The targetkernel must be run in a simulator to gather relevantperformance counters, and the binary file must beanalyzed. Johnston et al. (2018) gather their set ofarchitecture-independent program features by simulatingan OpenCL device using the Oclgrind simulatorand examining the LLVM intermediate representationproduced. They then use these features to generatea random forest model. Konstantinidis and Cotronis(2017) employ a microbenchmarking approach to gatherGPU performance metrics, and gather information abouttheir target kernels using Nvidia’s nvprof profiler on areference GPU. They show results on a larger kernel setthan most works we found, achieving an average errorof 27.66% (geometric mean 18%, not reported).Two recent articles survey the current GPUperformance modeling landscape. In Madougou et al.(2016), researchers perform an in-depth evaluation of12 GPGPU performance modeling tools, including 6analytical models: Hong and Kim (2009), Kothapalliet al. (2009), Li et al. (2015), Meswani et al. (2013), Simet al. (2012), and Song et al. (2013). They determinethat, while the analytical models tend to be accuratefor a particular hardware family or workload, they areless accurate for different hardware. They report thatconstructing all of these models requires significant effortand the collection or estimation of anywhere from 6to 30 parameters characterizing the hardware or theapplication, which is consistent with our experiencereproducing the results in Hong and Kim (2009).Lopez-Novoa et al. (2015) come to similar conclusionsafter surveying over 30 models for GPU computing ofvarious types, including 7 general-purpose execution-time-estimating models, i.e., models that are notdesigned for a particular application (Che and Skadron2014; Garc´ıa 2013; Hong and Kim 2009; Kerr et al. 2010;Kothapalli et al. 2009; Meng et al. 2011; Nugteren andCorporaal 2012). They conclude that there is no accurate Prepared using sagej.cls tevens, Kl¨ockner model valid for a wide set of architectures, and that eachmodel they consider makes a trade-off between accuracyand breadth of hardware applicability. They also notethat most models they surveyed are designed for CUDArather than vendor-neutral OpenCL, and that Hongand Kim (2009) stands out as the model of choice foraccurately predicting GPU execution times. 10 Conclusions We have demonstrated an alternative to previous GPUperformance modeling approaches: a framework forconstructing analytical models and calibrating themto a GPU using a customized measurement kernelset. Our framework allows a developer to controlthe trade-offs between model accuracy, complexity,generalization, and evaluation speed, and our hardware-blind model-calibration approach allows these modelsto make predictions on new devices with minimal effort.We demonstrate example execution time models forthree workloads yielding predictions accurate enoughto, e.g., allow an autotuner or human user to identifywhich kernel variant or subset of variants will havethe shortest execution times. Across all variants of allthree computations on all five GPUs, we achieved ageometric mean relative error of 6.4%. Additionally,we show how the transparency and interpretability ofthe model expressions, parameters, and features enablesusers to gain actionable insights into the factors affectingcomputational cost. Acknowledgements The authors would like to thank Dr. Bill Gropp forhelpful discussions and feedback on an earlier version ofthe manuscript, and express appreciation to Matt Wala forsignificant contributions to our underlying transformationengine Loopy and unfailing readiness to assist.The author’s work was supported in part by US NavyONR grant numbers N00014-14-1-0117, and by the NationalScience Foundation under grant numbers DMS-1418961 andCCF-1524433. We also gratefully acknowledge a hardwaregift from Nvidia Corporation.Opinions expressed herein are those of the author andin no way reflect the official position of any of the fundingagencies. References Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD and HwuWmW (2010) An Adaptive Performance Modeling Toolfor GPU Architectures. In: Proceedings of the 15thACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming , PPoPP ’10. New York, NY,USA: ACM. ISBN 978-1-60558-877-3, pp. 105–114. DOI:10.1145/1693453.1693470.Barvinok AI (1994) A polynomial time algorithm for countingintegral points in polyhedra when the dimension is fixed. Mathematics of Operations Research Proceedings of the 2006international conference on Compilers, architecture andsynthesis for embedded systems . ACM, pp. 24–34. DOI:10.1145/1176760.1176765.Che S and Skadron K (2014) Benchfriend: Correlating theperformance of gpu benchmarks. The InternationalJournal of High Performance Computing Applications Advances in Neural InformationProcessing Systems . pp. 3393–3404.Danalis A, Marin G, McCurdy C, Meredith JS, Roth PC,Spafford K, Tipparaju V and Vetter JS (2010) Thescalable heterogeneous computing (shoc) benchmarksuite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units .ACM, pp. 63–74. DOI: 10.1145/1735688.1735702.Emer J, Ahuja P, Borch E, Klauser A, Luk CK, Manne S,Mukherjee SS, Patil H, Wallace S, Binkert N and others(2002) Asim: A performance model framework. Computer Commun. ACM Modelo de estimaci´on de rendimientopara arquitecturas paralelas heterog´eneas . Master’s Thesis,Univ. Politecnica de Valencia.Gysi T, Grosser T and Hoefler T (2019) Absinthe: Learning ananalytical performance model to fuse and tile stencil codesin one shot. In: Proceedings of the 28th InternationalConference on Parallel Architectures and CompilationTechniques , PACT ’19.Hammer J, Eitzinger J, Hager G and Wellein G (2017)Kerncraft: A tool for analytic performance modeling ofloop kernels. CoRR abs/1702.04653.Hong S and Kim H (2009) An Analytical Model for aGPU Architecture with Memory-level and Thread-levelParallelism Awareness. In: Proceedings of the 36th AnnualInternational Symposium on Computer Architecture ,ISCA ’09. New York, NY, USA: ACM. ISBN 978-1-60558-526-0, pp. 152–163. DOI: 10.1145/1555754.1555775.Hong S and Kim H (2010) An integrated GPU power andperformance model. In: ACM SIGARCH ComputerArchitecture News , volume 38. ACM, pp. 280–289. DOI:10.1145/1815961.1815998.Jia W, Shaw KA and Martonosi M (2012) Stargazer:Automated regression-based gpu design space exploration.In: . pp. 2–13. DOI:10.1109/ISPASS.2012.6189201.Johnston B, Falzon G and Milthorpe J (2018) Openclperformance prediction using architecture-independentfeatures. CoRR abs/1811.00156.Joseph PJ, Vaswani K and Thazhuthaveetil MJ (2006) Apredictive performance model for superscalar processors.In: Proceedings of the 39th Annual IEEE/ACM Interna-tional Symposium on Microarchitecture . IEEE ComputerSociety, pp. 161–170. DOI: 10.1109/MICRO.2006.6. Prepared using sagej.cls XX(X) Kerr A, Anger E, Hendry G and Yalamanchili S (2012) Eiger:A framework for the automated synthesis of statisticalperformance models. In: . pp. 1–6.DOI: 10.1109/HiPC.2012.6507525.Kerr A, Diamos G and Yalamanchili S (2010) Modelinggpu-cpu workloads and systems. In: Proceedings ofthe 3rd Workshop on General-Purpose Computation onGraphics Processing Units , GPGPU-3. New York, NY,USA: ACM. ISBN 978-1-60558-935-0, pp. 31–42. DOI:10.1145/1735688.1735696.Kl¨ockner A, Wilcox LC and Warburton T (2016) Arrayprogram transformation with loo.py by example: High-order finite elements. In: Proceedings of the 3rdACM SIGPLAN International Workshop on Libraries,Languages, and Compilers for Array Programming ,ARRAY 2016. New York, NY, USA: ACM. ISBN 978-1-4503-4384-8, pp. 9–16. DOI: 10.1145/2935323.2935325.Kl¨ockner A (2014) Loo.Py: Transformation-based Code Gen-eration for GPUs and CPUs. In: Proceedings of ACM SIG-PLAN International Workshop on Libraries, Languages,and Compilers for Array Programming , ARRAY’14. NewYork, NY, USA: ACM. ISBN 978-1-4503-2937-8, pp.82:82–82:87. DOI: 10.1145/2627373.2627387.Kl¨ockner A (2015) Loo.Py: From Fortran to Performance viaTransformation and Substitution Rules. In: Proceedingsof the 2Nd ACM SIGPLAN International Workshopon Libraries, Languages, and Compilers for ArrayProgramming , ARRAY 2015. New York, NY, USA:ACM. ISBN 978-1-4503-3584-3, pp. 1–6. DOI:10.1145/2774959.2774969.Konstantinidis E and Cotronis Y (2017) A quantitativeroofline model for gpu kernel performance estimationusing micro-benchmarks and hardware metric profiling. Journal of Parallel and Distributed Computing . pp. 463–472. DOI:10.1109/HIPC.2009.5433179.Li A, Tay Y, Kumar A and Corporaal H (2015) Transit: Avisual analytical model for multithreaded machines. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing , HPDC’15. New York, NY, USA: ACM. ISBN 978-1-4503-3550-8,pp. 101–106. DOI: 10.1145/2749246.2749265.Linderman MD, Collins JD, Wang H and Meng TH (2008)Merge: A Programming Model for Heterogeneous Multi-core Systems. In: Proceedings of the 13th InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems , ASPLOS XIII. NewYork, NY, USA: ACM. ISBN 978-1-59593-958-6, pp.287–296. DOI: 10.1145/1346281.1346318.Lopez-Novoa U, Mendiburu A and Miguel-Alonso J (2015)A survey of performance modeling and simulationtechniques for accelerator-based computing. IEEETransactions on Parallel and Distributed Systems Parallel Computing 56: 18 – 33. DOI:10.1016/j.parco.2016.04.002.Meng J, Morozov VA, Kumaran K, Vishwanath Vand Uram TD (2011) Grophecy: Gpu performanceprojection from cpu code skeletons. In: Proceedingsof 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis , SC ’11.ACM. ISBN 978-1-4503-0771-0, pp. 14:1–14:11. DOI:10.1145/2063384.2063402.Meswani MR, Carrington L, Unat D, Snavely A, Baden Sand Poole S (2013) Modeling and predicting performanceof high performance computing applications on hardwareaccelerators. The International Journal of HighPerformance Computing Applications OpenCL programming guide . Pearson Education.Nickolls J, Buck I, Garland M and Skadron K (2008) Scalableparallel programming with cuda. Queue Proceedings of the 9thConference on Computing Frontiers , CF ’12. New York,NY, USA: ACM. ISBN 978-1-4503-1215-8, pp. 203–212.DOI: 10.1145/2212908.2212937.Pllana S and Fahringer T (2005) Performance prophet: Aperformance modeling and prediction tool for parallel anddistributed programs. In: . IEEE,pp. 509–516. DOI: 10.1109/ICPPW.2005.72.Sim J, Dasgupta A, Kim H and Vuduc R (2012) Aperformance analysis framework for identifying potentialbenefits in gpgpu applications. SIGPLAN Not. . pp. 673–686. DOI: 10.1109/IPDPS.2013.73.Spafford KL and Vetter JS (2012) Aspen: A domain specificlanguage for performance modeling. In: Proceedingsof the International Conference on High PerformanceComputing, Networking, Storage and Analysis , SC ’12.Los Alamitos, CA, USA: IEEE Computer SocietyPress. ISBN 978-1-4673-0804-5, pp. 84:1–84:11. DOI:10.1109/SC.2012.20.Treibig J and Hager G (2010) Introducing a PerformanceModel for Bandwidth-Limited Loop Kernels. In:Wyrzykowski R, Dongarra J, Karczewski K andWasniewski J (eds.) Parallel Processing and AppliedMathematics , Lecture Notes in Computer Science.Springer Berlin Heidelberg. ISBN 978-3-642-14390-8,pp. 615–624. DOI: 10.1007/978-3-642-14390-8 64.Unat D, Chan C, Zhang W, Williams S, Bachan J, Bell Jand Shalf J (2015) Exasat: An exascale co-design toolfor performance modeling. The International Journal of Prepared using sagej.cls tevens, Kl¨ockner High Performance Computing Applications IEEE Transactions on Paralleland Distributed Systems Mathematical Software–ICMS 2010 .Springer, pp. 299–302. DOI: 10.1007/978-3-642-15582-6 49.Verdoolaege S and Grosser T (2012) Polyhedral extractiontool. In: Second International Workshop on PolyhedralCompilation Techniques (IMPACT’12), Paris, France . pp.1–16. DOI: 10.13140/RG.2.1.4213.4562.Verdoolaege S, Seghir R, Beyls K, Loechner V andBruynooghe M (2007) Counting integer points inparametric polytopes using Barvinok’s rational functions. Algorithmica https://en.wikipedia.org/w/index.php?title=List_of_Nvidia_graphics_processing_units&oldid=923612324 . [Online; accessed 1-November-2019].Wu G, Greathouse JL, Lyashevsky A, Jayasena Nand Chiou D (2015) Gpgpu performance and powerestimation using machine learning. In: . pp. 564–576. DOI:10.1109/HPCA.2015.7056063.Zhang Y, Hu Y, Li B and Peng L (2011) Performanceand power analysis of ati gpu: A statistical approach.In: . pp. 149–158. DOI:10.1109/NAS.2011.51.Zhang Y and Owens JD (2011) A quantitative performanceanalysis model for GPU architectures. In: . pp. 382–393. DOI:10.1109/HPCA.2011.5749745.