[PDF] Using hardware performance counters to speed up autotuning convergence on GPUs

Abstract

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware. In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

Full PDF

UUsing hardware performance counters to speed up autotuning convergence on GPUs

Jiˇr´ı Filipoviˇc a, ∗ , Jana Hozzov´a a , Amin Nezarat a , Jaroslav Ol’ha a , Filip Petroviˇc a a Institute of Computer Science, Masaryk University, Botanick´a 68a, 60200 Brno, Czech Republic

Abstract

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware.However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type ofhardware and speciﬁc data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although theautotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i)the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeatedfrequently because of changes in processed data or migration to diﬀerent hardware.In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collectinghardware performance counters (also known as proﬁling counters) during empirical tuning. Those counters are used tonavigate the searching process towards faster implementations. The method requires the tuning space to be sampledon any GPU. It builds a problem-speciﬁc model, which can be used during autotuning on various, even previouslyunseen inputs or GPUs. Using a set of ﬁve benchmarks, we experimentally demonstrate that our method can speed upautotuning when an application needs to be ported to diﬀerent hardware or when it needs to process data with diﬀerentcharacteristics. We also compared our method to state of the art and show that our method is superior in terms of thenumber of searching steps and typically outperforms other searches in terms of convergence time.

Keywords: auto-tuning, search method, performance counters, cuda

1. Introduction

In the recent decade, GPU accelerators have been usedto speed up non-graphical applications on multiple classesof devices – from portable devices to supercomputers. Dif-ferent models of GPU hardware vary signiﬁcantly in termsof their architecture, even when manufactured by the samevendor. For example, there are eight generations of NVIDIACUDA GPUs available, diﬀering in the number of coresper multiprocessors, the presence of read/write L1 cache,the number of registers and more. Moreover, each genera-tion contains a broad range of GPU models with diﬀerentﬂoating-point performance and memory bandwidth. Asthe hardware characteristics heavily inﬂuence the perfor-mance of a given GPU kernel, its code needs to be adaptedfor each GPU model to achieve optimal performance – oth-erwise, performance portability is not ensured [20, 25, 27].Furthermore, kernels’ performance is also sensitive to in-put size, structure, or application settings, so a code op-timized for a certain input characteristics may run sub-optimally when those change [15, 24, 35]. ∗ Corresponding author

Email addresses: [email protected] (Jiˇr´ı Filipoviˇc), [email protected] (Jana Hozzov´a), [email protected] (Amin Nezarat), (Jaroslav Ol’ha), [email protected] (Filip Petroviˇc)

Autotuning is a method which allows codes to be auto-matically adjusted to given hardware and input [5]. Dur-ing GPU kernel development, programmers deﬁne tuningparameters – the properties of the code which inﬂuenceapplication performance. Each tuning parameter can takeone of the pre-deﬁned discrete values. The cross prod-uct of tuning parameters (potentially pruned by a priori known constraints) forms a tuning space . One point intuning space, which deﬁnes how the computational kernelis created and executed, is called a tuning conﬁguration .An autotuning framework searches the tuning space for atuning conﬁguration that maximizes a tuning objective,usually runtime or power consumption [5].Autotuning spaces have several properties that maketheir eﬃcient search diﬃcult. These discrete optimizationspaces with many dimensions are known to be non-convex,non-linear and with low locality [6]. The time needed fortuning space search can limit the practical usage of au-totuning. This happens especially in two cases: (i) whenthe tuning space is vast, and most of its conﬁgurationsperform poorly, thus the search takes long; (ii) when theperformance depends signiﬁcantly on input characteristics(e. g., size), which often change, thus the search needs tohappen often.State-of-the-art methods for searching generic tuningspaces (i. e., including any tuning parameters) view theirobjective as a function of tuning parameters. They are

Preprint submitted to Elsevier February 11, 2021 a r X i v : . [ c s . D C ] F e b ased on mathematical optimization [6, 39], or they use asurrogate performance/power model built from a sample oftuning space [18, 28, 10, 11]. Because the function relatingtuning parameters with the objective diﬀers with hardwareand input, those methods require the autotuning to berepeated from scratch when hardware or input changes.In this paper, we introduce a novel method for search-ing through autotuning spaces, which leverages historicaltuning data to speed up tuning process when input orhardware changes. Our method mimics the iterative op-timization performed by developers. Developers use pro-ﬁlers to collect performance counters and identify bottle-necks on multiple levels of hardware hierarchy. They alsounderstand how the properties of their code (i. e., tuningparameters) are related to performance counters. Theyiteratively detect bottlenecks and modify the code to re-duce stress on the bottlenecks until they reach suﬃcientcode performance. Our method aims to do the same. Itrequires a training phase, wherein the model is trainedto capture how tuning parameters inﬂuence performancecounters (machine-learning based analogy to a developer’sunderstanding of the relationship). During autotuning,the method iteratively proﬁles the code and acquires per-formance counters, analyzes bottlenecks and determineswhich performance counters should be changed to softenthe bottlenecks using an expert system (analogy to a de-veloper’s work with a proﬁling tool). Then, it uses themodel to determine which tuning conﬁgurations changeperformance counters in the required way. Finally, it se-lects the next tuning conﬁguration to proﬁle (analogy toa developer’s modiﬁcation of the code).The strength of our method is its ability to build amodel using a particular GPU and input, and use thismodel to speed up autotuning of a kernel running on a dif-ferent GPU or processing diﬀerent input . This is possiblebecause the method builds the model of relations betweentuning parameters and performance counters and deducesthe performance from performance counters, instead of re-lating tuning parameters directly to the performance, asis done in [6, 39, 18, 28, 10, 11]. Compared to the per-formance itself, the tuning parameters aﬀect performancecounters in a more straightforward and stable way . Thisis illustrated in Figure 1 for the Coulomb sum benchmark(described in the next section of this paper). In this ex-ample, the relation of the depicted tuning parameter tokernel runtime changes signiﬁcantly with diﬀerent inputand hardware, whereas the relation between the tuningparameter and performance counters remains stable. Aswe show in this paper, the stability of the relation betweentuning parameters and performance counters when chang-ing hardware or input makes it possible to train and usemodels created from historical tuning data to speed up We consider performance counters quantifying the number ofoperations on diﬀerent GPU subsystems. Another type of perfor-mance counters, quantifying relative occupancy of GPU subsystems,strongly depends on GPU architecture and input size.

Figure 1: Dependence between a tuning parameter and various prop-erties of the kernel, shown on Coulomb summation [13] runningwith large gridbox on GeForce GTX 750 and with small gridbox onGeForce GTX 1070. The x-axis shows a tuning parameter changingthread coarsening. The y-axis shows normalized values of selectedproperties: kernel runtime, L2 cache read transactions, texture cacheread transactions and 32-bit ﬂoating-point operations. tuning space search with various benchmarks.The evaluation is performed on a benchmark set of ﬁvekernels taken from [25, 27]. The obtained result shows thatmeasuring the performance counters and using a modelcreated from historical tuning data can bias the searchprocess towards faster convergence. The proposed proﬁle-based searcher systematically outperforms random searchusing a model created on diﬀerent GPU, diﬀerent input,or smaller tuning space. The proposed searcher also out-performs Basin hopping implemented in Kernel Tuner [39]and regression-tree model implemented in Starchart [18]in most cases.The paper makes the following major contributions: • Using hardware performance counters to navigate search-ing of tuning space.

The proposed proﬁle-based searcheris agnostic to the type of tuning parameters includedin the tuning space (both their number and the codeproperties they tune). To the best of our knowl-edge, this is the ﬁrst autotuning searcher using per-formance counters to speed up tuning space search-ing convergence. • Comprehensive evaluation of the proposed proﬁle-basedsearcher.

The proposed searcher is tested in multiplescenarios, including performance portability acrossvarious hardware, input, and tuning spaces, usingﬁve benchmarks and four GPU architectures. Wehave also compared our searcher to an optimization-based searcher [39] and a model-based searcher [18],showing that the proposed searcher converges faster2o the near-optimal conﬁguration in most cases. • Integration of the searcher into a real-world tuningframework.

The proposed proﬁle-based searcher isimplemented within Kernel Tuning Toolkit (KTT) [27].Therefore, it is possible to measure the real speed ofsearching convergence. Moreover, the tuning frame-work with the searcher and experimental data [12]are freely available to the community .The rest of the paper is organized as follows. In Sec-tion 2, an example of manual tuning using performancecounters is given. This example serves to illustrate theprocess which our framework intends to automate. Theproposed search method is described in Section 3. Sec-tion 4 evaluates the proposed searcher and compares it toalternative approaches. The related work is described inSection 5. We conclude and outline the future work inSection 6.

2. Example of Manual Tuning Space Search

In this section, we demonstrate a manual approach totuning space search using hardware performance counters.This example aims to ease the understanding of the con-cepts behind our proposed automatic searcher describedin the next section.

We use a simpliﬁed version of the 3D implementationof Direct Coulomb Summation introduced in [13] as anexample. With Direct Coulomb Summation, the electro-static potential around a molecule is computed on a regulargrid. For each grid point V i , we compute: V i = n (cid:88) j =0 w j π(cid:15) r ij (1)where n is the number of atoms, w j is the charge of j -thatom, r ij is the Euclidean distance between atom j andgrid point i , and (cid:15) is vacuum permitivity.Listing 1 shows the source code of the simpliﬁed Di-rect Coulomb Summation kernel. The input of the kernelconsists of atomInfo , containing atom coordinates andcharges in a vector ( x , y , z represent coordinates of theatom, w represents charge divided by 4 π(cid:15) ), numberOfAtoms represents the number of atoms in atomInfo , gridSpacing is the size of a grid cell, gridSize is the number of gridcells in each dimension and energyGrid contains outputpotential energy grid.In our implementation, one GPU thread computes oneor more grid points, determined by the value of Z ITERATIONS ,which is the only tuning parameter in this example. Foreach grid point, Equation 1 is computed. The code is ex-pected to be compute-bound for two reasons: (i) for each https://github.com/HiPerCoRe/KTT atom’s data loaded into the registers, we can compute val-ues of multiple grid points, and (ii) all the threads withina warp load the same atom at the same time, so the codecan beneﬁt from good cache locality. void directCoulombSum (2 const f l o a t 4 ∗ atomInfo , int numberOfAtoms ,3 f l o a t gridSpacing , int g r i d S i z e , f l o a t ∗ energyGrid )4 { // i n t e g e r coordinates within gr id int x = blockIdx . x ∗ blockDim . x + threadIdx . x ;7 int y = blockIdx . y ∗ blockDim . y + threadIdx . y ;8 int z = ( blockIdx . z ∗ blockDim . z + threadIdx . z )9 ∗ Z ITERATIONS ;1011 int s l i c e = g r i d S i z e ∗ g r i d S i z e ;12 int out = s l i c e ∗ z + g r i d S i z e ∗ y + x ;1314 // r e a l coordinate within molecule space f l o a t fX = gridSpacing ∗ xIndex ;16 f l o a t fY = gridSpacing ∗ yIndex ;17 f l o a t fZ = gridSpacing ∗ zIndex ;1819 // return value z e r o i z i n g f l o a t energyValue [ Z ITERATIONS ] ;21 for ( int i = 0 ; i < Z ITERATIONS ; i++)22 energyValue [ i ] = 0 . 0 f ;2324 // loop over a l l atoms for ( int i = 0 ; i < numberOfAtoms ; i++)26 { f l o a t dX = fX − atomInfo [ i ] . x ;28 f l o a t dY = fY − atomInfo [ i ] . y ;29 f l o a t dZ = fZ − atomInfo [ i ] . z ;30 f l o a t w = atomInfo [ i ] . w;31 // loop over m u l t i p l e gri d points for ( int j = 0 ; j < Z ITERATIONS ; j++) { f l o a t rd = r s q r t (dX ∗ dX + dY ∗ dY + dZ ∗ dZ ) ;34 energyValue [ j ] += w ∗ rd ;35 dZ += gridSpacing ;36 } } // s t o r e r e s u l t in g l o b a l memory for ( int i = 0 ; i < Z ITERATIONS ; i++)41 i f ( z + i < g r i d S i z e )42 energyGrid [ out + s l i c e ∗ i ] += energyValue [ i ] ;43 } Listing 1: Direct Coulomb Summation kernel

The code is autotuned by the KTT framework [27].KTT passes the tuning parameter via preprocessor macro

Z ITERATIONS . For the sake of simplicity, we do not opti-mize thread block size in this example. When tuning thecode, we will assign values from the set { , , , , , } to the tuning parameter. Higher values of

Z ITERATIONS improve data localityin registers: once atom coordinates and charge are read(lines 27-30), they are used multiple times in the loopbeginning at line 32. Moreover, this parameter reducesinvariant computation of dX ∗ dX and dY ∗ dY . How-ever, a value of Z ITERATIONS that is too high reducesstrong scaling (the kernel is executed by a lower amountof threads) and increases register consumption, which canlead to lower GPU occupancy or usage of slow local mem-ory for register spilling. Knowing and understanding howtuning parameters relate to performance counters is partof a developer’s expertise.3hen proﬁling the code, the developer can set thevalue of the tuning parameter by observing hardware per-formance counters. In this example, several counters arehighly relevant (some of them shown in Figure 1): • utilization of ﬂoating point (FP) units: if high, Z ITE-RATIONS should be set to a higher value (to reducethe number of FP operations by computing dX and dY fewer times); • utilization of texture or L2 cache: if high (and notcaused by register spilling), Z ITERATIONS should beset to a higher value (to improve cache locality); • GPU occupancy: if low,

Z ITERATIONS should be setto a lower value (to improve strong scaling); • the amount of local memory: if high, Z ITERATIONS should be set to a lower value in case utilization ofany part of the memory subsystem is also high (todecrease bandwidth introduced by register spills).Knowing and understanding how performance countersrelate to bottlenecks is also part of a developer’s expertise.In this simple example, we only search a one-dimensionaltuning space. However, the rules mentioned here also ap-ply with multiple tuning parameters: in such a case, theresulting kernel runtime and performance counter valuesare determined by a mixed eﬀect of all tuning parameters,but

Z ITERATIONS still aﬀects performance counters in thesame way.

Let us test our kernel using GeForce GTX 1070 witha grid of 256 × ×

256 points and 64 atoms. Thekernel should be compute-bound; therefore, we expect tohave high utilization of FP units when tuned. Lets’ set

Z ITERATIONS to 1. When proﬁled, the kernel runtime is10,475 µ s. Performance counters point to a clear bottle-neck in the texture cache (utilization level 9 out of 10),whereas FP instruction units are not highly loaded (uti-lization level 3 out of 10). Therefore, we raise the value of Z ITERATIONS to 32 to improve register locality. Now, theperformance improves signiﬁcantly: the kernel is executedin 1,531 µ s. The performance counters show the following:texture cache utilization dropped from 9 to 3, FP utiliza-tion is 8, reported GPU occupancy is 0.6 (out of 1.0). Wecan try to increase occupancy to reduce pipeline and mem-ory latencies. We set Z ITERATIONS to 8 and get a slightlybetter runtime of 1,497 µ s. The GPU occupancy is now0.98, FP utilization is 0.9 and texture cache utilization is9. The texture cache utilization can be lowered by setting Z ITERATIONS to 16; however, this change does not lead toperformance improvement; therefore, we can consider thekernel to be tuned.Now, consider a change of the input data: using thesame GPU, we will compute a grid of 25 × ×

25 pointsand 4096 atoms. The previously tuned implementation, setting

Z ITERATIONS to 8, yields a runtime of 394 µ s. Oc-cupancy of this implementation is 0.11. Some speedup canbe reached by increasing occupancy, so we need to explorevalues lower than 8 for Z ITERATIONS . When set to 4, theruntime is improved to 260 µ s. Occupancy is improvedto 0.17, and we can already see the bottleneck on tex-ture cache (utilization level 8), which suggests we shouldnot decrease Z ITERATIONS further. When we try to set

Z ITERATIONS to 2, the runtime increases to 428 µ s. Al-though occupancy is improved to 0.32, the implementationis limited by the texture cache, which now reports utiliza-tion level 9. Therefore, we consider tuning to be ﬁnished,considering 4 to be the best value for Z ITERATIONS . In this section, we showed how tuning is performed ra-tionally by an expert programmer, who understands therelation between tuning parameters and performance coun-ters, as well as the relation between performance countersand bottlenecks. The programmer proﬁles the tuned codeand analyzes the observed bottlenecks. Then, they de-termine the direction of change of the tuning parameterbased on an expected eﬀect on performance counters. Inthe next section, we describe how to automatize the entireprocess.

3. Proposed search method

In this section, we describe the basic idea and assump-tions of the proposed proﬁle-based search method, andthen we focus on the details of our current implementa-tion.

Our search method aims to mimic the process of thedeveloper when optimizing the performance. Similarly tothe developer, the method proﬁles the tuned code, analysesobserved bottlenecks and decides how to change the tuningparameters to suppress them.We capture the developer’s expertise regarding howtuning parameters (TPs) relate to performance counters(PCs) in a model , and how performance counters relate tobottlenecks in an expert system . In our work, we considerthe objective (performance in our case) to be a function ofperformance counters, and base our algorithm on that. Weemphasize the trends, i.e. we do not rely on the currentvalues of tuning parameters or performance counters – wefocus more on how their change in a certain direction in-ﬂuences the direction and the magnitude of the change inthe objective. That makes it possible for us to say whichtuning conﬁgurations might improve the objective, eventhough we do not predict the exact value of that objec-tive.4e recognize two categories of hardware performancecounters (PCs): the counters measuring the stress of a pro-cessor component, called PC stress and the counters mea-suring the number of operations performed on the compo-nent, called PC ops .We assume that1. it is possible to determine the computational bottle-necks from the values of performance counters (both PC stress and PC ops );2. it is possible to determine which PC ops need to bechanged to suppress the bottlenecks;3. the relation between TP and PC ops is suﬃcientlyportable across diﬀerent GPUs and problem inputs.The ﬁrst and second assumptions are based on knowl-edge of how performance counters work. If somethingseriously hinders the performance, performance countersshould be able to reﬂect that in some manner (the ﬁrstassumption). When the bottleneck is known, it is possibleto determine which PC ops reﬂects it and therefore, whichtype of operations should be reduced to improve the per-formance. We aim to mirror this process by creating anexpert system for bottleneck analysis using available doc-umentation and our expertise – details are in Section 3.4.We evaluate these two assumptions experimentally in Sec-tion 4.3.The intuition behind the third assumption is as follows.The values of PC ops depend on the GPU architecture, theproblem input, and all tuning parameters. However, tun-ing parameters tend to change (i. e., decrease or increase) PC ops in a consistent way, independently of the GPU ar-chitecture, problem input and other TPs. Consideringour example in Section 2, the parameter Z ITERATIONS decreases the number of GPU threads. When a largergrid is computed (i. e., input changes), the overall num-ber of GPU threads would be greater, but the increaseof

Z ITERATIONS would still lower the number of GPUthreads. The same holds for ﬂoating point operations, asshown in Figure 1.The exception to the rule of well-portable relations be-tween TP and PC ops are performance counters related tocache and memory subsystem. Since diﬀerent GPU archi-tectures have diﬀerent cache sizes, the threshold of cachecapacity misses diﬀers across GPU architectures. There-fore, TP controlling cache blocking will start to increasecache bandwidth at a diﬀerent value. However, this im-precision should only be observed within a small range ofTP values, which causes cache footprint to be close to theGPU cache capacity. In our example shown in Figure 1,the amount of texture cache transactions is similar on bothbenchmarks, whereas L2 cache workload is changed by thetuning parameter more signiﬁcantly when solving a larger Let us take global memory as an example: here, the percentageof the peak bandwidth of global memory is in PC stress , whereas thenumber of global memory transactions is in PC ops . gridbox on GTX 1070. We further evaluate the third as-sumption experimentally in Section 4.4 and Section 4.5.Overall, if all of our assumptions are true, it is pos-sible to mimic the process of a developer’s performanceoptimization during autotuning using the following steps: • start with a certain tuning conﬁguration; • measure PCs of the tuned kernel, e. g., the amountof local memory used; • determine the bottlenecks using the expert system,e. g., the bandwidth on multiple levels of the memoryhierarchy is high; • determine the desired change in PCs (denoted as PC’from now on), with the expert system, e. g., lower theamount of local memory used; • evaluate which tuning conﬁgurations change PCs inthe desired way using the model, e. g., those whichhave lower Z ITERATIONS than the proﬁled conﬁgu-ration; • select the next tuning conﬁguration.We replace the selection of the next tuning conﬁguration,usually done by the developer, with a step of randomsearch biased by a score determining the likelihood of chang-ing PCs in the required direction.To sum up, we move from given values of TPs (deﬁninga tuning conﬁguration) through PCs (performance coun-ters) to bottlenecks, and then from bottlenecks throughPC’ (changes to performance counters) back to the newset of TP (new tuning conﬁguration). The architecture of the whole tuning workﬂow is de-picted in Figure 2. It consists of two phases: one fortraining beforehand and one for the actual search duringautotuning.During the training phase, two components are used.KTT is responsible for executing tuning and gathering PCson a sample or a complete tuning space. Then, a modelis built from the tuning data using any of the methodsdescribed in Section 3.3. Both components are problem-and GPU-independent. The raw tuning data are problem-,GPU- and input-dependent. The model is still problem-dependent (diﬀerent problems can implement completelydiﬀerent TP), but, according to the third hypothesis, it isindependent on GPU model and input.During the autotuning for GPU and input of our in-terest, KTT is used to collect PC stress and PC ops . Then,the next component analyzes PC stress to determine bot-tlenecks. It also uses PC ops to distinguish the sourceof the bottleneck more precisely (e. g., if global memorybandwidth is the bottleneck, it uses the number of mem-ory transactions to recognize how much is the bottleneck5 igure 2: Schematic view of the searcher workﬂow. The boxes showprogram components, cylinders show data objects. formed by memory reads or stores). The PC change com-putation component determines the required change of PC ops . The bottleneck analysis and PC change compu-tation components are GPU-dependent, because perfor-mance counters have been changed or extended severaltimes as GPUs have evolved. Therefore, those componentscontain explicit support for diﬀerent sets of performancecounters. It is important to note that the bottleneck com-ponent analyzes the bottlenecks of the GPU architecturewhich is used for autotuning, whereas changes of PC ops are computed for the GPU architecture the model wasbuilt on during the training phase. Therefore, we can useone model to steer autotuning on multiple architectures .When the required changes of PC ops are determined, thenext component computes the scores of the tuning con-ﬁgurations. It uses the model to predict PC ops accordingto TP and set higher scores to conﬁgurations which arepredicted to change PC ops in the required way. With thescores set, the searcher performs a step by selecting thenext conﬁguration for KTT to benchmark and proﬁle. In this section, we describe the models for how TPs re-late to PCs. These models are created during the trainingphase, i. e., before autotuning. During autotuning, we usethem to evaluate which tuning conﬁgurations change PCsin the desired way (i. e., from PC’ to TPs), as described indetail in section 3.5. An alternative implementation would be to translate PC ops usedin the model for the GPU architecture used with autotuning. We model the relation between TPs and PCs throughnon-linear regression models. As an input, we take thewhole or a part of the tuning space that has been evaluatedand proﬁled – for each tuning conﬁguration, performancecounters are measured. We split the tuning space intosubspaces determined by the values of binary tuning pa-rameters (e.g. in a tuning space with three binary param-eters, we split into 2 = 8 subspaces). For each subspace,we select datapoints for training in a deliberate way – foreach non-binary parameter, we choose two or three valuesto keep the total number of value combinations relativelylow, but to make the sampling of the subspace rather evendespite constraints. The model for each proﬁling counterincludes the main eﬀects of each non-binary tuning pa-rameter, their interactions (even those of a high order)and their inﬂuences of quadratic nature. We applied theleast-square regression method to compute the coeﬃcients.Therefore, as an output, we end up with several models foreach proﬁling counter, where the values of binary tuningparameters determine the applicability of the model. Thepredictions themselves can be computed using the valuesof non-binary tuning parameters.For example, consider adding a parameter controllingthe use of constant memory for atoms in the example de-scribed in Section 2. Such a parameter is binary (con-stant memory is either used or not used). Therefore, twomodels are created to capture relations between PCs and Z ITERATIONS for using and not using constant memory.

Decision tree builds regression or classiﬁcation modelsin the form of a tree structure. It breaks down a datasetinto smaller and smaller subsets while at the same time anassociated decision tree is incrementally developed. Theﬁnal result is a tree with decision nodes and leaf nodes. Adecision node has two or more branches, each representingvalues for the attribute tested (TPs in our case). Leafnode represents a decision on the numerical target (valuesof PCs). The topmost decision node in a tree correspondsto the best predictor called the root node. Decision treescan handle both categorical and numerical data.The core algorithm for building decision trees called ID3 [29]employs a top-down, greedy search through the space ofpossible branches with no backtracking. The ID3 algo-rithm can be used to construct a decision tree for regres-sion by replacing Information Gain with Standard Devi-ation Reduction. The decision of making strategic splitsaﬀects a tree’s accuracy heavily. The decision criteria arediﬀerent for classiﬁcation and regression trees. Regressiontrees usually use mean squared error (MSE) to decide tosplit a node into two or more sub-nodes.We generate a set of candidate trees. The trees userandomly selected 50% of the explored tuning space fortraining and the rest for testing. We also alter parentnodes of the trees. We compute MAE (Mean AbsoluteError) and RMSE (Root Mean Square Error) for those6rees, and select the one with the lowest MAE – in case of atie, we select the one with lower RMSE). The selected treeis then used to predict PCs for unknown conﬁgurations.

In this section, we describe how we get from PCs tobottlenecks and from bottlenecks to PC’. This part is per-formed by expert systems which read performance coun-ters, analytically determine bottlenecks, and create a vec-tor of required changes of PC ops .The components described in this section do two steps– the ﬁrst component takes performance counters mea-sured on an actually executed kernel and computes bot-tlenecks. The second component takes bottlenecks andcomputes changes in performance counters for the archi-tecture of the model. Therefore, the components have toknow the PCs of the currently executed kernel, the archi-tecture of the GPU where the kernel was executed, andthe architecture of the GPU the model was created for.Since the performance counters changed completely forVolta generation and newer, the components are imple-mented for multiple sets of counters. The implementationis limited by our understanding of NVIDIA GPU’s per-formance counters, and we believe there is still room forimprovement if the counters were documented in greaterdetail, or if the entire expert system was replaced by amachine learning model. We have developed the expertsystem using NVIDIA CUPTI documentation for CUDA10.0. To improve our understanding of the counters’ val-ues, we have also used some benchmarks available in KTT.However, to avoid overﬁtting the expert system to ourbenchmarks, we have only used three benchmarks for itsdevelopment (Coulomb sum, n-body and Matrix transpo-sition), while the rest of the benchmarks have been usedfor evaluation of our model after it was developed. The bottleneck analysis uses a set of performance coun-ters described in Table 1. We use abbreviations of thosecounters in the following text and formulas. The table alsoshows the equivalency of old (i. e. on GPUs prior to Volta)and new (i. e. on Volta and newer) counters; applied ad-justments or rescalings (e. g., the old version of countersmeasures rank in < , > , whereas the new version mea-sures percentage in < , > ); and classiﬁcation of coun-ters into PC stress or PC ops . Note that the equivalence ofnew and old performance counters is not completely docu-mented, so we compute it according to our understandingof what counters measure. Also note that we have assignedINST ISSUE U into PC ops , because it quantiﬁes the ratioof instruction cycles which cannot be used to issue instruc-tions (for instance, due to synchronizations).Bottlenecks are represented by a vector B = [ b DRAM read , b DRAM write , · · · ], where ∀ b x ∈ B, b x ∈ < , > . A zerovalue of the bottleneck means that the component is notstressed, while the value 1 represents the component being on the theoretical peak of its performance. Bottlenecks aredivided into several categories, representing stress on var-ious memory types, instructions and utilization of GPUparallelism.The memory subsystems bottleneck analysis is com-puted in the same way for global memory, shared memoryand L2 cache. The utilization of the memory, reportedby performance counter DRAM U, is scaled into < , > and weighted by the ratio of read and write transactions.Equation 2 shows the computation of the bottleneck forglobal memory read, Equation 3 shows the computationof the bottleneck for global memory write. For sharedmemory and L2 cache, the computation is analogous, us-ing SHR LT, SHR WT, SHR U and L2 RT, L2 WT, L2 U,respectively. b DRAM read = DRAM RTDRAM RT + DRAM WT · DRAM U

10 (2) b DRAM write = DRAM WTDRAM RT + DRAM WT · DRAM U

10 (3)

To compute the utilization of texture (data) cache, weonly scale the counter TEX U to interval < , > , asthe cache is read-only and therefore there are no writetransactions.The situation is more complicated with local memory.To the best of our knowledge, the LOC O performancecounter gives us a relative number of local memory datatransfers. Therefore, even high local memory overheaddoes not imply the local memory is a real bottleneck ifno memory subsystem is overloaded. Consequently, weweight local memory overhead by the maximal utilizationof memories in the way: global, L2 and texture, see Equa-tion 4. b local = LOC O · max( DRAM U , L2 U , TEX U

10 ) (4)

Special performance counters measure the utilizationof instructions. However, we found those counters not tobe very reliable, or we do not fully understand what theymeasure (for example, the matrix transposition example,which uses no ﬂoating point computations, reports highutilization of FP32 units in some conﬁgurations). There-fore, we derive the utilization from the overall amount ofinstructions. First, we compute the ins ﬁtted value usingEquation 5, which computes the number of instructions(INST EXE computes at the warp level; therefore, it hasto be multiplied by 32), corrected by the eﬃciency of in-struction execution (i. e., how many threads in the warpperform useful work). ins ﬁtted = 32 · INST EXE · WARP E · WARP NP E (5)

Then, we compute the utilization of the issue slot. ForGPUs prior to Volta, we use ins util = INST ISSUE U . ForVolta and newer, as those GPUs allow us to issue inte-ger and ﬂoating-point instructions separately, we compute7 ounter (prior Volta) Counter (Volta and newer) Abbr. Typedram utilization dram throughput.avg.pct of peak sustained elapsed : 10 DRAM U Stressdram read transactions dram sectors read.sum DRAM RT Ops.dram write transactions dram sectors write.sum DRAM WT Ops.l2 utilization lts t sectors.avg.pct of peak sustained elapsed L2 U Stressl2 read transactions lts t sectors op read.sum L2 RT Ops.l2 write transactions lts t sectors op write.sum L2 WT Ops.tex utilization l1tex t requests pipe lsu mem global op ld.avg.pct of peak sustained active : 10 TEX U Stresstex cache transactions l1tex t requests pipe lsu mem global op ld.sum TEX RWT Ops.local memory overhead l1tex t sectors pipe lsu mem local op st.sum LOC O Ops.shared utilization l1tex data pipe lsu wavefronts mem shared.avg.pct of peak sustained elapsed : 10 SHR U Stressshared load transactions l1tex data pipe lsu wavefronts mem shared op ld.sum SHR LT Ops.shared store transactions l1tex data pipe lsu wavefronts mem shared op st.sum SHR WT Ops.sm efficiency smsp cycles active.avg.pct of peak sustained elapsed SM E Stressinst fp 32 smsp sass thread inst executed op fp32 pred on.sum INST F32 Ops.inst fp 64 smsp sass thread inst executed op fp64 pred on.sum INST F64 Ops.inst integer smsp sass thread inst executed op integer pred on.sum INST INT Ops.inst misc smsp sass thread inst executed op misc pred on.sum INST MISC Ops.inst compute ld st smsp sass thread inst executed op memory pred on.sum INST LDST Ops.inst control smsp sass thread inst executed op control pred on.sum INST CONT Ops.inst bit convert smsp sass thread inst executed op conversion pred on.sum INST BCONV Ops.inst executed smsp inst executed.sum INST EXE Ops.issue slot utilization smsp issue active.avg.pct of peak sustained active INST ISSUE U Ops.warp execution efficiency smsp thread inst executed per inst executed.ratio ·

100 : 32 WARP E Stresswarp nonpred execution efficiency smsp thread inst executed per inst executed.pct WARP NP E Stress

Table 1: List of performance couters and their abbreviations for GPUs. For counters implemented for Volta generation and newer, theconversion ratio (if any) is written next to the counter. ins util = min(1 , INST ISSUE U ). Therefore, the compo-nent considers full utilization of one instruction path asperfect utilization and cannot optimize toward better uti-lization of dual-issue in the current implementation. Theutilization of FP32 units is computed using Equation 6.An analogous computation is performed for INST F64,INST INT, INST MISC, INST LDST, INST CONT andINST BCONV. b fp32 = INST F32ins ﬁtted · ins util (6) Finally, the bottleneck formed by the insuﬃcient in-struction issue utilization is computed. First, we computemaximal utilization across all types of instructions: util max = max(

INST F32ins ﬁtted , INST F64ins ﬁtted , . . . ) (7)

Then, the bottleneck of instruction issue is computed usingEquation 8. b issue = util max · − INST ISSUE U

100 (8)

Insuﬃcient parallelism is computed as b sm = 100 − SM E . (9) An additional bottleneck of insuﬃcient parallelism iscomputed from the number of CUDA threads and thenumber of CUDA cores, as shown in Equation 10. It is anempirical computation, requiring ﬁve threads per CUDAcore to set the parallelism bottleneck to zero. However,we have found it useful to add this bottleneck alongside b sm , as it captures the cases where all SMs are occupied,but the actual occupancy is low due to small number ofthreads running per SM. b paral = max (0 , ( cores · − threads ) / ( cores · When the bottleneck vector is computed, it is passed tothe component responsible for computing required changesto PCs. The required changes are represented as the vec-tor PC (cid:48) ops = [ pc (cid:48) DRAM RT , pc (cid:48) DRAM WT , . . . ], where ∀ pc (cid:48) x ∈ PC (cid:48) ops , pc (cid:48) x ∈ < − , > . The PC (cid:48) ops vector contains therequired changes to the performance counters PC ops . Anegative value means that the counter value should bedecreased, while a positive value means that the countershould be increased – zero means no change is required.There is one additional parameter for PC (cid:48) ops computa-tion: inst reaction . It determines the threshold of instruc-tion-related bottleneck values, which triggers the changesof PC ops . The motivation behind the introduction of inst re - action parameter is as follows. The instructions have verylow latency (compared to the memory subsystem). There-fore, they do not form a serious bottleneck unless there ishigh stress on the component performing the instructions.In case of no signiﬁcant bottleneck for the kernel, the lowbottlenecks related to instructions are ignored, and thememory is optimized, which should to decrease latency andintroduce a new bottleneck, either instruction- or memory-related. In our implementation, the inst reaction is set to0.7. If user sets that the instruction-bound problem is op-timized, inst reaction is set to 0.5, which slightly improvesresults for compute-bound problems as the reaction on ap-pearing instructions-related bottleneck is faster.The changes of PCs related to the memory subsystemsare computed straightforwardly: their value is set as theinverse value of the corresponding bottleneck, for example, pc (cid:48) DRAM RT = − b DRAM read .The instruction-related PCs are set only if the corre-sponding bottleneck value exceeds inst reaction , as shownin Equation 11. An analogous computation is performedfor all instruction-related bottlenecks including b issue . b fp32 ≤ inst reaction : pc (cid:48) INST F32 =0 b fp32 > inst reaction : pc (cid:48) INST F32 = − ( bfp32 − inst reaction )1 − inst reaction (11)The parallelism-related bottlenecks are applied straight-forwardly, without inverting the bottleneck value: pc (cid:48) SM E = b sm and pc (cid:48) global = b paral . Note that pc (cid:48) global is not a truehardware performance counter, but represents the numberof threads reported by KTT, which is added to the set ofperformance counters.8 .5. Tuning conﬁguration scoring Let c proﬁle be the conﬁguration which has been empir-ically proﬁled during autotuning. Let PC (cid:48) be the requiredchanges of PCs to soften bottlenecks (see Section 3.4.2) ofthe conﬁguration c proﬁle . To score an unexplored tuningconﬁguration c candidate , we need to estimate whether theconﬁguration c candidate changes PCs (compared to c proﬁle )in a direction deﬁned by PC (cid:48) ops .Since autotuning can be executed on a diﬀerent GPUand diﬀerent input, we cannot directly compare estimatedPCs of c candidate to measured PCs of c proﬁle . Instead, weuse a model described in Section 3.3 to estimate PCs ofboth conﬁgurations: let PC ( c proﬁle ) be PCs of c proﬁle esti-mated by the model and PC ( c candidate ) be PCs of c candidate estimated by the model. The score s of the conﬁguration c candidate is computed using Equation 12. Note that themeasured and predicted values of some perfomance coun-ters might be zero, so to avoid division by 0, we do notinclude those PCs in the score computation. s = (cid:88) p ∈ PC used pc (cid:48) p pc p ( c proﬁle ) − pc p ( c candidate ) pc p ( c proﬁle ) + pc p ( c candidate ) (12) where PC used is the set of proﬁling counters with non-zero predictions for proﬁled and candidate conﬁgurations: PC used = ∀ p ∈ PC (cid:48) : pc p ( c proﬁle ) (cid:54) = 0 ∧ pc p ( c candidate ) (cid:54) = 0and where pc p means performance counter p .We compute all scores s . . . s n for conﬁgurations of ourinterest. Then, the score values are further processed. Thelowest and highest score are stored in s min and s max , re-spectively. The score for each conﬁguration is normalizedinto interval < . , > as follows: s > s norm =(1+ s smax ) s > γ ∧ s ≤ s norm =max(0 . , (1 − s smin ) ) s ≤ γ : s norm =0 . (13)The normalized score is used to bias the probability ofselecting that conﬁguration. Equation 13 ensures strongpreference of high scores (positive scores are ampliﬁed into < , > by normalization). However, it also allows forthe selection of negative scores, albeit with very low prob-ability. If the score before normalization is below cutoﬀthreshold γ (deﬁned as − .

25 in our implementation), thenormalized score is set to 0.0001. Negative scores greaterthan γ are scaled similarly to the positive scores. Settingnon-zero probability for negative scores is important insituations where all unexplored conﬁgurations seem worsethan c proﬁle , which either indicates that the tuning hasreached the optimum, or that it got stuck in a local op-timum due to inaccuracy in the model or in the expertsystem decision. Our current implementation uses weighted random search,where the random selection is biased towards conﬁgura-tions which should soften bottlenecks found within the last proﬁled conﬁguration. It is the most straightforward ap-proach, which allows for direct comparison to (unweighted)random search. However, we believe that more sophisti-cated searching methods could be implemented in the fu-ture.The complete workﬂow of the search is described byAlgorithm 1. The algorithm takes several variables as in-put: TS is a tuning space (i. e., an array containing alltuning conﬁgurations), M is a model of the relation be-tween TPs and PCs, n is the number of steps performedwithout collecting performance counters and i is the num-ber of iterations for collecting performance counters (i. e.,the number of empirical tests is i ( n +1)). Note that CUDAkernels are executed faster when no performance countersare collected. Therefore, executing runs without collect-ing performance counters helps to ﬁnd a fast kernel morequickly, although with the cost of performing more empir-ical search steps. The default value of n is set to 5 in ourimplementation.Algorithm 1 uses the following components. In Line 3,the empirical evaluation of kernel c proﬁle is performed andits runtime and PCs are stored into t , PC stress and PC ops .Then, bottlenecks are determined and stored in b in line 4.After bottleneck detection, the searcher computes a reac-tion to the bottlenecks in line 5. The reaction is storedin the form of the required change of performance coun-ters PC (cid:48) ops . In the current implementation, the algorithmscores all tuning conﬁgurations according to the requiredchange of performance counters: it uses model M to deter-mine if a tuning conﬁguration changes performance coun-ters in a direction required by PC (cid:48) ops , see line 9 . Withthe scored tuning space, the algorithm selects n kernels tobenchmark without gathering performance counters. Inthe loop starting at line 16, the algorithm selects a conﬁg-uration with a probability weighted by the score obtainedin line 9. Then, the benchmarking of the selected conﬁgu-ration is performed. The fastest observed conﬁguration isused for the next proﬁling run (performed in line 3). The main part of the proposed proﬁle-based searcheris implemented in Python in its current version. As KTTis implemented in C++, there is a simple stub searchercode in KTT, which communicates with the main part ofthe searcher via ﬁles and sockets. The reason for Pythonimplementation is easier experimenting with the searchercode as well as a broad range of easy-to-use machine learn-ing algorithms. Nevertheless, such implementation hasalso a disadvantage in code performance. It is possibleto rewrite the searcher code into C++ in the future.The searcher consists of two parts: an application build-ing the model and a plugin into KTT, which implements In the case of huge tuning spaces, a restricted set of tuning con-ﬁgurations could be scored, e. g., only the neighbourhood of the cur-rently explored conﬁguration c proﬁle . LGORITHM 1:

Searching tuning space with perfor-mance counters.

Input:

T S, M, n, i c proﬁle ← random c ∈ TS for j = 0; j ≤ i ; j + + do t, PC stress , PC ops ← empiricaly measure runtime andPCs of c proﬁle b ← bottlenecks obtained from PC stress and PC ops PC (cid:48) ops ← changes in PCs required to weakenbottlenecks b S ← size ( TS ) for k = 0; k < size ( TS ); k + + do if TS k was not evaluated then S k ← compute score of TS k , using M and PC (cid:48) ops end else S k ← end end t ← ∞ for k = 0; k ≤ n ; k + + do r ← random number: r ≤ ∧ r < (cid:80) size ( TS ) l =0 S l select l such that (cid:80) l − i =0 S i ≥ r ∧ (cid:80) li =0 S i < r ) t (cid:48) ← empirically measure runtime of TS l if t (cid:48) ≤ t then c proﬁle ← TS l t ← t (cid:48) end S l ← end end the proposed searcher method. The searcher is availablein profile-searcher directory in KTT . Although the proposed proﬁle-based searcher is fullycapable of searching the tuning space, biasing the searchtowards faster implementations, there are still many openresearch topics which can lead to the improvement of thesearcher.

Currently, our implementation biases a simple randomsearcher. Although a random searcher is very robust, itis a global search method, which cannot speed up its con-vergence by following the gradient of the optimized func-tion. Our searcher allows us to estimate which conﬁg-urations should improve performance. This estimationcould be used as an estimation of the gradient of the per-formance function, and thus a gradient-based searchingmethod could be used. As shown in [39], local searchingmethods can be successful, especially in combination witha global searching method, which allows them to overcomelocal optima.

Currently, the relations of TPs to PCs are always mod-elled for the GPU where the model was constructed. Al-though it is suﬃcient to steer the tuning space search, itcannot precisely model the eﬀect of cache capacity misses.We believe that it is possible to create a model which cancorrect cache-related PCs according to detected cachingcapabilities of the GPU used for autotuning.

Our current expert system for bottleneck analysis andcomputation of the required changes of PCs works with asubset of PCs which are available on GPUs. For exam-ple, PCs related to execution stall reasons are not used atall. Although it is not easy to construct an expert sys-tem which would interpret all PCs, it could be possible toreplace the expert system with a machine learning modelwhich could overcome our limited understanding of PCs in-terpretation. To construct the model, we can use multiplebenchmarks and GPUs available. Moreover, each bench-mark consists of a high number of tuning conﬁgurations;therefore, it is not too diﬃcult to generate a high volumeof training and testing data.The runtime of kernel proﬁling is determined by thetype and amount of gathered hardware performance coun-ters. Together with improving bottleneck analysis and re-action system, we could test if some counters are mostlyredundant and reduce their amount, similarly to [1]. GIT tag v1.3-profile-searcher . Evaluation In this section, we evaluate our proposed proﬁle-basedsearcher. First, we describe the methodology of the bench-marks and the testbed setup. Then, we evaluate how ran-dom step biasing inﬂuences the searcher performance whenusing a model created for the same setup, for a diﬀer-ent GPU or for a diﬀerent input. We also evaluate theconvergence time of the proposed method. Finally, wecompare our method with model-based tuning using Star-chart [18] and with optimization-based tuning using Ker-nel Tuner [39].

We have performed two types of evaluation: • measuring the number of empirical tests of kernels(i. e., searcher steps); • measuring tuning convergence in time.The empirical test means that the kernel is compiledand executed in the selected tuning conﬁguration. Thenumber of empirical tests allows us to directly comparethe eﬃciency of the proposed proﬁle-based search method:how many empirical tests it needs to perform in order toconverge to a certain eﬃciency of the kernel. On the otherhand, the GPU kernel runs slower when being proﬁled asit gathers PCs. Therefore, even reducing the number ofempirical tests does not ensure faster tuning convergenceof the proposed searcher. Therefore, we have also testedtuning convergence speed. In our evaluations, we oftenaim to ﬁnd a well-performing conﬁguration. We deﬁnea well-performing conﬁguration as a conﬁguration whichcreates a kernel with runtime within 1 . × of the best kernelruntime (found by exhaustive search).Since the tuning space search is a stochastic process,it has to be repeated many times to get results which arefree from random ﬂuctuations. As it would be incredi-bly demanding to actually run and proﬁle kernels duringautotuning, we have created a program bypassing that –instead of running kernels many times, it performs an ex-haustive exploration of the entire tuning space and savesthe tuning results (kernel runtimes and PCs). Then wecan perform autotuning space search faster, i. e. simplyload the kernel times and PCs from ﬁles and provide themto the searcher instead of actually running the kernel andproﬁling it. This allows us to repeat tuning 1000 × forany combination of benchmark, hardware used to createthe model, and hardware used for autotuning. We onlyused this simulated autotuning to evaluate the number ofsearcher steps.While we were evaluating tuning convergence time, wealways performed empirical testing, i. e. we actually ranand proﬁled the kernels and measured the time. Therefore,we executed those tests 100 × due to much higher timedemands. Benchmark dimensions conﬁgurationsConvolution 10 3,928Coulomb 3D 7 210GEMM 10 5,788GEMM full 14 205,216Transpose 8 1,784N-body 7 3,134Table 2: A list of the benchmarks and the size and dimensionality(i. e., the number of tuning parameters) of their tuning spaces.

To test the searcher proposed in this paper, we usedbenchmarks listed in Table 2. Those benchmarks havebeen introduced in [27], where more details about theirimplementation and optimizations are discussed. We haverewritten these benchmarks into CUDA to allow their pro-ﬁling on NVIDIA GPUs. We removed the values of sometuning parameters as they are not supported on CUDA(large built-in vector variables, explicit cache prefetching),or are not implemented in CUDA KTT backend (workingwith constant memory). However, our benchmarks are stillable to reach near-peak performance on GPUs, similarlyto [27] . Therefore, we consider their tuning spaces suﬃ-cient for evaluation of the tuning space searcher. We havealso implemented two versions of tuning space for GEMM– the reduced space is taken from CLBlast [24] (denoted asGEMM) and the full space is taken from CLTune [25] (de-noted as GEMM full). This smaller size of GEMM spaceallows us to explore the entire tuning space in reasonabletime, thus it is more practical for most of the experiments.Note that expert programmers have designed the tun-ing spaces to be reasonably small (without obviously slowconﬁgurations, e. g., resulting in sub-warp block sizes) [27].Such search spaces should not include a vast amount ofpoorly-performing tuning conﬁgurations and therefore shouldnot discriminate simple search methods, such as randomsearch. Device Architecture ReleasedGeForce GTX 680 Kepler 2012GeForce GTX 750 Maxwell 2014GeForce GTX 1070 Pascal 2016GeForce RTX 2080 Turing 2018Table 3: GPU devices used in our benchmarks.

The benchmarks have been executed using four GPUsof diﬀerent architectures, listed in Table 3. All bench-marks have been executed with Kernel Tuning Toolkit 1.3equipped with the proﬁle-based searcher . The highest performance degradation compared to the richerOpenCL tuning spaces was 5% in Coulomb sum benchmark runningon GeForce GTX 680. At most, 3% performance degradation wasobserved in the remaining combinations of benchmarks and GPUs. https://github.com/HiPerCoRe/KTT/tree/v1.3-profile-searcher .3. Speedup allowed by the performance counters biasing In the ﬁrst experiment, we have measured the improve-ment given by biasing random search according to the re-quired changes of PCs, computed by bottleneck analysisand reaction subsystems. In other words, we experimen-tally evaluated the ﬁrst two assumptions mentioned in Sec-tion 3.1. To do so, we needed to eliminate any eﬀect ofthe model imprecision. Therefore, we have performed ex-haustive exploration of PCs of the benchmarks’ completetuning spaces and then, during autotuning, we read thepreviously measured PCs of tuning conﬁgurations insteadof predicting them with the model.

GTX 680 GTX 750 GTX 1070 RTX 2080Coulomb sum 19 21 34 16Matrix trans. 192 24 10 47GEMM 146 248 450 260n-body 27 10 37 39Convolution 327 702 349 568Table 4: Average number of empirical tests required for randomsearch to ﬁnd a well-performing conﬁguration.GTX 680 GTX 750 GTX 1070 RTX 2080Coulomb sum 3 . × . × . × . × Matrix trans. 3 . × . × . × . × GEMM 5 . × . × . × . × n-body 1 . × . × . × . × Convolution 8 . × . × . × . × Table 5: Improvement (in terms of the number of empirical tests)of proposed searcher over random search. The PCs from the samearchitecture are used.

The average number of empirical tests required by ran-dom search to ﬁnd a well-performing conﬁguration is shownin Table 4. The resulting improvement of the proposedsearcher over random search is shown in Table 5. The ta-ble shows improvement in terms of the average number ofempirical search steps required to ﬁnd a well-performingconﬁguration. The average is obtained from 1,000 runs ofthe searcher.As we can see in Table 5, the proposed searcher im-proves the number of empirical tests in all cases. The mostsigniﬁcant improvement can be seen with the GEMM andConvolution benchmarks. Their tuning spaces are morecomplicated to search (see Table 4), so the improvementgiven by the biased random selection is more visible.

In this section, we have evaluated the portability of themodel. We created the model on speciﬁc hardware and in-put and used it to bias tuning space search on diﬀerenthardware . Therefore, we have experimentally tested the In this case, we could not read PCs from a diﬀerent GPU orinput directly, as the executable conﬁgurations generally diﬀer fordiﬀerent GPUs or inputs. Therefore, we have also had to use thepredictive model. third assumption given in Section 3.1. The model wascreated using decision trees described in Section 3.3. Sim-ilarly to the experiment above, we have measured the im-provement over the random search in terms of the numberof iterations (i. e., empirical tests) required to reach a well-performing conﬁguration. The results were averaged over1,000 executions.

Table 6: Improvement (in terms of the number of empirical tests) ofproposed searcher over random search. Rows show GPUs used forbenchmark execution, columns show GPUs used to create the modelof TPs and PCs relations.

Coulomb sum benchmark

GTX 680 GTX 750 GTX 1070 RTX 2080GTX 680 3 . × . × . × . × GTX 750 4 . × . × . × . × GTX 1070 4 . × . × . × . × RTX 2080 4 . × . × . × . × Matrix transposition benchmark

GTX 680 GTX 750 GTX 1070 RTX 2080GTX 680 3 . × . × . × . × GTX 750 1 . × . × . × . × GTX 1070 1 . × . × . × . × RTX 2080 2 . × . × . × . × GEMM benchmark

GTX 680 GTX 750 GTX 1070 RTX 2080GTX 680 5 . × . × . × . × GTX 750 3 . × . × . × . × GTX 1070 4 . × . × . × . × RTX 2080 4 . × . × . × . × n-body benchmark GTX 680 GTX 750 GTX 1070 RTX 2080GTX 680 1 . × . × . × . × GTX 750 2 . × . × . × . × GTX 1070 3 . × . × . × . × RTX 2080 3 . × . × . × . × Convolution benchmark

GTX 680 GTX 750 GTX 1070 RTX 2080GTX 680 9 . × . × . × . × GTX 750 14 . × . × . × . × GTX 1070 0 . × . × . × . × RTX 2080 14 . × . × . × . × The performance portability results of Coulomb sum,Matrix transposition, GEMM, n-body and Convolutionbenchmarks are given in Table 6. The table shows thespeedup of the proposed searcher for all combinations of aGPU used to build the model and a GPU used for auto-tuning.It can be seen that the model portability is good inmost of the examples. In some cases, using a model builton a diﬀerent GPU is even better than using the modelcreated on the same GPU that was used for autotuning(e. g., GEMM benchmark executed on RTX 2080 convergesfaster with the model created from GTX 1070 data). How-ever, we consider this to be a somewhat random artefactwhere inaccuracies in the model compensate for inaccura-cies in the expert system biasing the search.Note that the decision tree predictions have been usedinstead of the exact reading of PCs even in cases when the12ame GPU was used for both model building and auto-tuning. Therefore, the values on diagonals of tables withinTable 6 do not copy values from Table 5. For example, theGEMM benchmark, when executed on RTX 2080, gets aspeedup of 10 . × with the exact PCs from RTX 2080,whereas the decision tree model from the same GPU lim-its its speedup to 9 . × . × ×

128 16 × × × . × . × . × . × ×

128 2 . × . × . × . × × . × . × . × . × ×

16 1 . × . × . × . × Table 7: Improvement (in terms of the number of empirical tests) ofproposed searcher over random search with the GEMM benchmark.Rows show the sizes used for benchmark execution, columns showthe sizes used to create the model of TP and PC relations.

We further investigated the performance portability forthe case of varying input. In our example, we used theGEMM benchmark with signiﬁcantly varying input ma-trices: • multiplication of square matrices of size 2048 × • multiplication of square matrices of size 128 × • multiplication of highly rectangular matrices (a ma-trix of size 4096 × × ×

16 multipliedby a matrix of size 4096 × × even if usinga model trained on the memory-bound GEMM instance.The improvement with small or highly-rectangular matri-ces is lower because searching their tuning spaces requiresa lower number of searcher steps. In this section, we compare the convergence of the pro-posed proﬁle-based searcher with the convergence of ran-dom search in time. Our motivation for this comparison isthat random search can converge faster despite requiring more empirical tests because random search has two ad-vantages over the proﬁle-based searcher: it does not col-lect performance counters (collecting performance coun-ters hinders the kernel execution speed), and it has mini-mal computational overhead (no model is evaluated). Onthe other hand, the proposed searcher navigates the searchtowards faster implementations; therefore, omits evalua-tion of very slow conﬁgurations. Whereas the collection ofperformance counters is slow especially with slow-runningkernels (the kernel runtime dominates over the time of ker-nel compilation, data copying and selection of new con-ﬁguration to test), the overhead of the searcher is moresigniﬁcant with vast tuning spaces.

To test the convergence time of the searchers, we haveused a machine equipped with NVIDIA GeForce RTX 2080,Intel Core i7-8700, 32 GB RAM, Ubuntu 18.04.3, NVIDIAdriver 418.39 and CUDA 10.1. Each benchmark has beenexecuted a hundred times. The model has been createdusing data obtained with GeForce GTX 1070. Therefore,our test simulates a situation when the user acquires abrand new GPU and has autotuning data from an olderGPU.We have performed autotuning with persistent dataon GPU and no comparison to the reference computation.Such a setting is typical for dynamic autotuning. It im-proves the results of the random searcher, as there is nooverhead of data movement and comparison, which couldhide the slower execution of kernels while gathering per-formance counters. On the other hand, we have selectedthe size of the kernels’ inputs such that the GPU is fullyoccupied, but the kernels are not running longer than nec-essary (we consider 1-10 milliseconds as a suitable kernelruntime for our experiments). This choice, on the otherhand, improves the results of the proposed searcher, be-cause proﬁled kernels do not run for too long. In twoexperiments, we switch on the comparison to the refer-ence kernel and use larger kernel input to demonstrate thedescribed eﬀects.In this evaluation, we show the searcher’s convergenceas a graph of the average kernel runtime at each second ofautotuning. When a long-running kernel is selected in theﬁrst iteration of the search, the benchmark has no ﬁnishedkernel for several seconds after its execution (this is espe-cially the case when gathering performance counters). Wedecided to draw the graph from the time when all 100 ex-periments have at least one ﬁnished kernel and thereforeits time is known. Such a solution prevents the resultsof the proposed searcher from being artiﬁcially improved,e. g., by only plotting the runtime of the ﬁnished (and thusfast) kernels during the ﬁrst seconds of measurements.

The GEMM and Convolution benchmarks are the mostdiﬃcult to autotune (see Table 4). Therefore, improv-ing the speed of convergence is most important for those13 be s t k e r ne l t i m e ( n s ) tuning time (s) proﬁle-basedrandom Figure 3: Convergence of GEMM, 2048 × × be s t k e r ne l t i m e ( n s ) tuning time (s) proﬁle-basedrandom Figure 4: Convergence of Convolution, 4096 × benchmarks. The GEMM benchmark was run using squarematrices of size 2048 × × × be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-basedrandom 1.4e+06 1.6e+06 1.8e+06 2e+06 2.2e+06 2.4e+06 2.6e+06 0 50 100 150 200 250 be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-basedrandom Figure 5: Convergence of Matrix transposition, 8192 × be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-basedrandom 5.3e+07 5.4e+07 5.5e+07 5.6e+07 5.7e+07 5.8e+07 5.9e+07 6e+07 6.1e+07 0 20 40 60 80 100 be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-basedrandom Figure 6: Convergence of n-body on GTX 2080, model from GTX1070. Left: tuning with 16,384 bodies. Right: tuning with 131,072bodies. The solid line shows the average, the transparent area showsthe standard deviation. converges signiﬁcantly faster – see Figure 6. We also testeda situation with tuning executed on a much bigger probleminstance of 131,072 bodies (note that n-body is in O ( n ),where n is the number of bodies). In such a setup, kernelsrun much longer and proﬁling overhead included in theproposed searcher makes it slower than random searcher(see Figure 5).The Coulomb sum benchmark uses a grid of 256 × ×

256 cells and 256 atoms. It converges very quickly,even with the random searcher. Figure 7 shows that theproposed searcher needs some time for initial proﬁling.Then, it converges very quickly, whereas random searchermatches its performance after 20 seconds of tuning. How-ever, both searchers converge close to the optimum in lessthan 5 seconds. Therefore, the choice of the searcher isnot important here.The setup of this experiment also allows us to bench-mark the GEMM full benchmark. As we have not per-formed exhaustive search of GEMM full, the model wascreated from the tuning space of the GEMM benchmark,measured on GeForce GTX 1070. Note that the GEMMbenchmark does not cover the tuning space of the GEMMfull benchmark completely – it lacks four tuning parame-ters and uses less then 3% of GEMM-full conﬁgurations.However, Figure 8 shows that the proposed searcher con-verges much faster than random searcher with the GEMMfull benchmark: the proposed searcher requires 61 secondsto match the same results as the random searcher reachesafter 300 seconds. This result can be further improved byoptimizing the implementation of the proposed searcher:14 be s t k e r ne l t i m e ( n s ) tuning time (s) proﬁle-basedrandom Figure 7: Convergence of Coulomb Sum, grid 256 × × be s t k e r ne l t i m e ( n s ) tuning time (s) proﬁle-basedrandom Figure 8: Convergence of GEMM full, 2048 × × with the GEMM full tuning space, it requires signiﬁcanttime to score tuning conﬁgurations, resulting in 3 × longertime per empirical test. So far, we have compared our proposed proﬁle-basedsearcher to random search. However, the Basin Hoppingoptimization has recently been shown to perform betterthan many other optimization methods [39]. Therefore,we also compare our searcher (with the settings describedin Section 4.6) to the Basin Hopping searcher implementedin Kernel Tuner [39].

We used the same machine as for KTT (see Section 4.6)to test Kernel Tuner. We have re-implemented KTT ex-amples for Kernel Tuner and autotuned them with BasinHopping optimization in default settings . We are aware that the parameters of the Basin Hopping methodcan be tuned towards improving search convergence speed. However, be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-based KTTrandom KTTbasinhopping Kernel Tuner 4e+06 4.2e+06 4.4e+06 4.6e+06 4.8e+06 5e+06 0 10 20 30 40 50 60 be s t k e r ne l t i m e ( n s ) tuning iterationproﬁle-based KTTrandom KTTbasinhopping Kernel Tuner Figure 9: Convergence of the Coulomb sum benchmark using KTTand Kernel Tuner. Left: convergence speed in time. Right: compar-ison of iterations (empirical tests). The solid line shows the average,transparent area shows the standard deviation.

During the testing of Kernel Tuner, we have observedthat it works slower than KTT, even when the randomsearcher is used. We suppose that the source of its slowerspeed is twofold. First, Kernel Tuner is implemented inPython, whereas KTT is a C++ native application. Sec-ond, Kernel Tuner executes each kernel three times to geta better timing precision. Although KTT only executeseach kernel once, it gives more consistent results: when ex-ecuted multiple times, the KTT timing is more stable thanperformance times reported by Kernel Tuner. We supposethis is caused by the overhead of the python-based Ker-nel Tuner implementation. We also found that the mea-surement of Kernel Tuner systematically reports slightlyhigher kernel runtimes compared to KTT. Therefore, wehave normalized performance times to be comparable: wehave done an exhaustive search of the tuning space andmultiplied the times reported by Kernel Tuner by the ra-tio of the best time measured by KTT and the best timemeasured by Kernel Tuner. Therefore, both tuning sys-tems converge at the same kernel times in our ﬁgures.As the comparison of timing is aﬀected by the slowerexecution of Kernel Tuner, we have also compared thenumber of searcher steps here. The number of steps allowsus to deduce how well Basin Hopping navigates the tuningspace compared to random search and the proposed searchbased on performance counters, without the handicap ofslower Kernel Tuner execution.

Figure 9 shows a comparison of the proposed searcherimplemented in KTT and Basin hopping implemented inKernel Tuner with Coulomb sum benchmark. Comparingconvergence times, the proposed searcher converges fasterduring the ﬁrst 10 seconds. After 10 seconds, it is slightlyoutperformed by Basin hopping. When we compare thenumber of empirical tests, the proposed searcher behavessimilarly to Basin hopping.The GEMM benchmark is compared in Figure 10. Itcan be seen that the Basin Hopping optimization imple- the optimal values of those parameters are not known a priori , so itcannot be tuned before autotuning for new hardware or input is ﬁn-ished. Therefore, we kept the default settings for a fair comparison. be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-based KTTrandom KTTbasinhopping Kernel Tuner 2.6e+06 2.8e+06 3e+06 3.2e+06 3.4e+06 3.6e+06 3.8e+06 4e+06 4.2e+06 0 200 400 600 800 1000 be s t k e r ne l t i m e ( n s ) tuning iterationproﬁle-based KTTrandom KTTbasinhopping Kernel Tuner Figure 10: Convergence of the GEMM benchmark using KTT andKernel Tuner. Left: convergence speed in time. Right: comparisonof iterations (empirical tests). The solid line shows the average, thetransparent area shows the standard deviation. be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-based KTTrandom KTTbasinhopping Kernel Tuner 1.4e+06 1.6e+06 1.8e+06 2e+06 2.2e+06 2.4e+06 2.6e+06 0 50 100 150 200 be s t k e r ne l t i m e ( n s ) tuning iterationproﬁle-based KTTrandom KTTbasinhopping Kernel Tuner Figure 11: Convergence of the Matrix transposition benchmark usingKTT and Kernel Tuner. Left: convergence speed in time. Right:comparison of iterations (empirical tests). The solid line shows theaverage, transparent area shows the standard deviation. mented in Kernel Tuner converges more slowly than ran-dom searcher during the ﬁrst 70 seconds and then it out-performs random search slightly. However, the reason forsuch a result of Basin Hopping is related to the slower ex-ecution in Kernel Tuner. When we compare the numberof empirical tests, Basin Hopping converges to the opti-mum much faster than random searcher after the ﬁrst 200iterations. The proposed proﬁle-based searcher uses signif-icantly fewer empirical tests than both Random and BasinHopping and outperforms them in both the number of em-pirical tests and the convergence time.With the Matrix transposition benchmark, the slow-down of Kernel Tuner is more visible, see Figure 11. Ker-nel Tuner requires about 16 seconds to start tuning, prob-ably because of complicated constraints pruning the tun-ing space. After initialization, it converges quickly, butcannot outperform random or proposed searcher. Whenwe compare the number of empirical tests, Basin Hop-ping again converges near the optimum much faster thanthe random searcher. However, the proposed proﬁle-basedsearcher uses signiﬁcantly fewer empirical tests.With the n-body benchmark, the Basin-hopping methoddoes not converge well, see Figure 12. Comparing bothconvergence time and number of empirical tests, it per-forms worse than random searcher and the proposed searcher.Similarly to n-body, the Basin-hopping method con-verges slowly in the Convolution benchmark, see Figure 12.Comparing the number of empirical tests, it works simi-larly to the random searcher. However, it converges muchslower when real-time tuning is measured. be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-based KTTrandom KTTbasinhopping Kernel Tuner 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 2e+06 2.2e+06 2.4e+06 0 50 100 150 200 250 300 be s t k e r ne l t i m e ( n s ) tuning iterationproﬁle-based KTTrandom KTTbasinhopping Kernel Tuner Figure 12: Convergence of the n-body benchmark using KTT andKernel Tuner. Left: convergence speed in time. Right: comparisonof iterations (empirical tests). The solid line shows the average,transparent area shows the standard deviation. be s t k e r ne l t i m e ( n s ) tuning time (s)proﬁle-based KTTrandom KTTbasinhopping Kernel Tuner 400000 500000 600000 700000 800000 900000 1e+06 0 50 100 150 200 250 300 be s t k e r ne l t i m e ( n s ) tuning iterationproﬁle-based KTTrandom KTTbasinhopping Kernel Tuner Figure 13: Convergence of the Convolution benchmark using KTTand Kernel Tuner. Left: convergence speed in time. Right: compar-ison of iterations (empirical tests). The solid line shows the average,transparent area shows the standard deviation.

Both the Matrix transposition and Convolution bench-marks reduce tuning space signiﬁcantly using pre-deﬁnedconstraints. When tuning space (created as the cross prod-uct of pre-deﬁned tuning parameter values) is pruned byconstraints, only 1.55% of conﬁgurations remain with Ma-trix transposition, and 0.025% of conﬁgurations remainwith Convolution. We speculate that this may cause asigniﬁcant delay at the start of tuning (16 seconds withMatrix transposition and 45 seconds with Convolution). Incontrast, 41.7% of conﬁgurations remain after constraintapplication with Coulomb sum, 33.3% with n-body and22.1% with GEMM. Although this issue could be solvedby optimizing the Kernel Tuner code, the comparison ofempirical tests still shows Basin hopping needs more em-pirical tests than the proposed searcher.

To compare our approach to regression trees, we em-ployed Starchart [18], which trains regression trees to pre-dict the outcome variable (usually performance or con-sumed power) based on the values of tuning parameters.With the trained tree, one can predict the outcome forthe entire space and ﬁnd the conﬁgurations with the bestpredicted outcomes.

In their evaluation, Starchart authors started the train-ing with 20 datapoints and added more iteratively untilprediction accuracy, measured as the median relative pre-diction error, got below 15%, or a maximum of 200 training16atapoints was reached. Prediction accuracy was evalu-ated on 200 validation datapoints. Both training and val-idation datapoints were selected by uniform random sam-pling. It is worth noting that the tuning spaces used toevaluate Starchart in [18] were often very large as they con-tained entire ranges of values instead of only a few mean-ingful ones, e.g. 32-1024 instead of just 32, 64, 128, 256,512 and 1024 as block sizes.We evaluate the accuracy of the approach as follows: • we randomly select 200 tuning conﬁgurations as atesting set; • we perform training by measuring additional tuningconﬁgurations until median relative prediction errorgets below 15%; • we measure how many conﬁgurations, sorted by bestpredicted kernel time, need to be examined to ﬁndone that is actually good.In the last step of the protocol, we take the whole tun-ing space, calculate the predictions and sort them fromthe best to the worst. In this order, we go one by oneand examine whether the conﬁguration is actually well-performing, i.e. within 110% of the best kernel time.We have evaluated ﬁve benchmarks (coulomb, gemm-reduced, conv, mtran, nbody) using GeForce RTX 2080.Because Starchart is not a tuning toolkit, we cannot di-rectly compare tuning speed in time. Therefore, we com-pare the number of empirical tests here. Table 8: Results of autotuning using Starchart compared to the ran-dom searcher, using GeForce RTX 2080. All columns show the num-ber of empirical tuning steps. The model build column shows stepsrequired for model training and testing, tuning shows steps requiredfor tuning before a well-performing conﬁguration is found, and ran-dom shows the number of steps required by random searcher (takenfrom Table 4).

GeForce GTX 1070 model build tuning randomCoulomb sum 233 26 33Matrix trans. 378 14 9GEMM 400 636 449n-body 397 21 36Convolution 321 278 348

GeForce RTX 2080 model build tuning randomCoulomb sum 234 7 15Matrix trans. 397 50 46GEMM 391 413 259n-body 369 9 38Convolution 259 9 567

We performed the comparison of Starchart and randomsearcher for GeForce GTX 1070 in Table and for GeForceRTX 2080 in Table 8. The tables show the number of tun-ing steps required to build a model and to perform a search until a well-performing conﬁguration is found. The valuesfor random searcher are taken from Table 4. As can beseen, including the model build phase, Starchart performsworse than the random searcher, excepting the Convolu-tion benchmark on GeForce RTX 2080. We believe that awell-designed tuning space, which does not contain a vastamount of poorly-performing tuning conﬁgurations, can-not be easily tuned with surrogate models created as partof the tuning process itself.

SC@1070 proposed@1070Coulomb sum 3 5Matrix trans. 42 18GEMM 564 16n-body 17 6Convolution 9 26

Table 9: Results of autotuning using Starchart compared to the pro-posed searcher, using GeForce RTX 2080 and models trained onGeForce GTX 1070. Both columns show the number of empiricaltuning steps.

Although Starchart is not designed to ensure modelportability (i. e., using the model trained on a diﬀerentGPU), we have performed a test when a regression treefrom a diﬀerent GPU is used. The idea behind this exper-iment is that if performance portability is good for some ofthe well-performing conﬁgurations (i. e., there are conﬁgu-rations which perform well on both GPUs), the regressiontree trained on a diﬀerent GPU can be used successfully.We build the regression tree on GeForce GTX 1070 anduse it to navigate searching on GeForce RTX 2080. Wehave compared the Starchart results to our searcher usinga model from GeForce GTX 1070 to navigate tuning onGeForce RTX 2080. The comparison is given in Table 9.In this scenario, Starchart can compete with the proposedsearcher in case the regression tree can describe the tuningspace well, and some of the well-performing conﬁgurationsof the GPU used for model build are also well-performingon the GPU used for autotuning. However, our modeldoes not have such a restriction. The proposed searcheralso seems to be more robust: there are no long tuningtimes as with the Starchart and GEMM example (on bothGeForce GTX 1070 and RTX 2080) and Convolution (onGeForce GTX 1070).

5. Related Work

Tuning space search methods can be divided into threecategories: (i) model-free methods, which are viewing tun-ing space search as a mathematical optimization prob-lem and empirically search for the best conﬁguration; (ii)model-based methods, which use a performance or powermodel to select the best conﬁguration from the tuningspace directly; (iii) hybrid methods, which combine em-pirical search with a model.In this section, we compare our approach with thestate-of-the-art methods in tuning space search, and also17ith methods using historical proﬁling data and perfor-mance counters for code optimization.

The model-free methods based on optimization are usedin the majority of code optimization parameter autotuningframeworks [2, 14, 25, 31, 39]. As tuning spaces are notcontinuous, contain boolean tuning variables and are non-linear and non-convex, searching tuning spaces is challeng-ing. Therefore, in some experiments, the random searchseems more reliable than more sophisticated search meth-ods [33, 25]. On the other hand, some authors show im-provement over random search. In [6], the Nelder-Meadmethod has been successfully adapted to non-continuoustuning spaces. More recently, it has been shown that thecombination of global and local search has the potentialto outperform random search [39] systematically. The dif-ference between model-free optimization-based searchingand our method is that optimization has to be executedfrom scratch when hardware or a performance-relevant in-put parameter changes, whereas our method uses a modelportable across diﬀerent GPUs and inputs. In this paper,we compare the results of our searcher to [39]. We con-ﬁrmed result from [39] showing that Basin hopping cou-pled with local search is superior to random search. Wealso show that our searcher outperforms Basin hopping inboth number of empirical tests and convergence time.

The model-based methods need to use a model whichcan be evaluated faster than empirical testing. Therefore,simulators such as GPGPU-Sim [4] are not practical forthis purpose. Multiple models for analysis and perfor-mance prediction of CUDA or OpenCL code have been de-veloped [16, 3, 42, 34, 17, 21]. However, the constructionof those models often needs manual eﬀort, which makesthem impractical for autotuning. Moreover, independentstudies reveal high inaccuracies in their ability to predictperformance on some problems [37, 22], so it is question-able whether they can be used to select good tuning con-ﬁgurations from a vast tuning space. On the other hand,the model-based tools seem to have promising results ifonly one tuning parameter is searched. For example, ithas been shown that a model can replace empirical tuningto select a well-performing work-group size (thread blockin CUDA terminology) [9, 8, 41], thread-coarsening fac-tor [9] or code variant [23]. In our work, we focus on theoptimization of many tuning parameters together.

Hybrid methods introduce a surrogate model which isnot intended to ﬁnd the best tuning conﬁguration directly,but rather to prune or bias empirical searching. In [10], amachine learning model is built from a sample of tuningspace, and only the tuning conﬁgurations with the best predicted performance are empirically tested. The au-thors show that their method can outperform the randomsearch. Regression trees have been used to speed up auto-tuning in multiple studies [11, 18, 28]. The regression treesare built from a representative sample of the tuning space,and their precision can be improved during search [11]. Allthe papers evaluate their approach using rather vast tun-ing spaces, e. g., testing all integer thread block sizes in theinterval < , > , instead of using more rational sizes2 n , n ∈ < , > . Therefore, the regression trees can bebuilt from a tiny fraction of the tuning space. With ratio-nally constructed tuning spaces, such as those presentedin [25, 27], a large portion of tuning space has to be usedfor tree construction, as we show experimentally in Sec-tion 4.8. As the models have to be re-trained when hard-ware or performance-relevant input characteristics change,they are not practical when a large portion of tuning spaceis needed for training. In contrast, our approach allows forthe use of a model trained on diﬀerent hardware or input. Only a few works use historical autotuning data (e. g.,obtained on diﬀerent hardware) to improve autotuningconvergence speed. In [23], authors predict the best codevariant of a kernel for unseen GPU by using data obtainedon a set of explored GPUs. The diﬀerence in our workis that we focus on searching a multi-dimensional tuningspace, and only one GPU is suﬃcient for empirical ex-ploration with our model. The historical data observed onone CPU architecture are used to bias a multi-dimensionalsearch on a diﬀerent CPU architecture in [32]. Their ap-proach works well if the fast implementations on both ar-chitectures are correlated. Our approach does not requiresuch a correlation. A model which automatically selects atuning conﬁguration for a speciﬁc application input is pre-sented in [36]. It is trained on diﬀerent inputs and thengeneralizes how to select the right tuning conﬁguration forthe used input. Compared to our model, it does not needto repeat empirical search when the input changes. Onthe other hand, our model does not need to be trainedon multiple inputs and GPUs. In our previous work [26],we used data measured on one hardware to prune dimen-sions on diﬀerent hardware. While this approach workswell for speeding up exhaustive search, it brings no advan-tage when coupled with a searcher based on mathematicaloptimization.

The typical usage of hardware performance counters isto proﬁle a kernel for manual inspection of bottlenecks.However, there are several works which use performancecounters to navigate optimization automatically. In [23],authors use performance counters to detect relevant fea-tures for their code variant selection model. In [19, 40],performance counters are used to approximate performanceand scaling of a known implementation on unseen GPUs.18ur work focuses on searching the tuning space of poten-tially unseen implementations. CPU performance coun-ters have been used to select a well-performing combi-nation of the number of OpenMP threads and schedul-ing strategy [38], well-performing compiler optimizationswitches [7], and well-performing combination of activatedor deactivated prefetchers [30]. In contrast, our work fo-cuses on optimization of any user-deﬁned tuning parame-ters.

6. Conclusion and future work

In this paper, we have introduced a novel tuning spacesearcher, which uses hardware performance counters tospeed up the searching process. The proposed proﬁle-based searcher builds a model of relations between tuningparameters and performance counters using any GPU andinput, and uses this model to navigate searching on an un-seen GPU or input. We have experimentally shown thatthe performance counters can signiﬁcantly reduce the num-ber of empirical tests that have to be performed in auto-tuning. We have also shown that the real-time convergenceof the proposed searcher is typically superior to randomsearcher or state-of-the-art optimization-based Basin hop-ping searcher implemented in Kernel Tuner [39]. The pro-posed searcher is implemented in Kernel Tuning Toolkit [27],and therefore it can be used for both oﬄine and dynamicautotuning.Although the searcher was designed for CUDA-enabledGPUs, we believe that a similar searcher can be developedfor GPUs of diﬀerent vendors or even diﬀerent processorarchitectures, such as CPUs.In future work, we plan to improve the searcher per-formance. As discussed in Section 3.8, some componentsof the proposed proﬁle-based searcher can be replaced bya more sophisticated implementation. We plan to experi-ment with (potentially gradient-based) local search meth-ods, predict performance counters for the hardware usedfor autotuning and replace the system for bottleneck anal-ysis and reaction with a machine-learning model.Apart from improving the proposed searcher, we planto investigate other possibilities to leverage hardware per-formance counters. First, we plan to extend our system topredict how well-tuned the actual conﬁguration of a ker-nel is. Such a prediction allows us to stop the tuning atthe right time, as well as predict how much performancecan be gained by autotuning (which is especially impor-tant during dynamic autotuning, when the tuning time islimited). Second, we plan to use the vast amount of tuningdata for studying the behavior and eﬃciency of diﬀerentHW architectures and code optimization strategies.The source code of Kernel Tuning Toolkit, the pro-posed proﬁle-based searcher, and proﬁled data conductedfor this study are available to the community. Therefore,it is easy to replicate results from this study, test thesearcher with diﬀerent benchmarks, or modify or extendour searcher implementation.

Acknowledgements

The work was supported from European Regional Develop-ment Fund-Project ”CERIT Scientiﬁc Cloud” (No. CZ.02.1.01-/0.0/0.0/16 013/0001802). Computational resources were sup-plied by the project ”e-Infrastruktura CZ” (e-INFRA LM2018140)provided within the program Projects of Large Research, De-velopment and Innovations Infrastructures.

References [1] J. Alcaraz, A. Sikora, and E. C´esar. Hardware counters’ spacereduction for code region characterization. In

Euro-Par 2019:Parallel Processing , pages 74–86. Springer International Pub-lishing, 2019.[2] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bos-boom, U.-M. O’Reilly, and S. Amarasinghe. OpenTuner: Anextensible framework for program autotuning. In

Proceedingsof the 23rd International Conference on Parallel Architecturesand Compilation , PACT ’14, pages 303–316, 2014.[3] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, andWen-mei W. Hwu. An adaptive performance modeling tool forgpu architectures.

SIGPLAN Not. , 45(5):105–114, 2010.[4] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M.Aamodt. Analyzing cuda workloads using a detailed gpu simu-lator. In , 2009.[5] P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K.Hollingsworth, B. Norris, and R. Vuduc. Autotuning in high-performance computing applications.

Proceedings of the IEEE ,106(11):2068–2083, Nov 2018.[6] P. Balaprakash, S. M. Wild, and P. D.” Hovland. Can search al-gorithms save large-scale automatic performance tuning?

Pro-cedia Computer Science , 4:2136 – 2145, 2011.[7] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P. O’Boyle,and O. Temam. Rapidly selecting good compiler optimizationsusing performance counters. In

International Symposium onCode Generation and Optimization (CGO’07) , 2007.[8] T. A. Connors and A. Qasem. Automatically selecting prof-itable thread block sizes for accelerated kernels. In , 2017.[9] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. End-to-end deep learning of optimization heuristics. In , pages 219–232, 2017.[10] T. L. Falch and A. C. Elster. Machine learning based auto-tuning for enhanced OpenCL performance portability. In

Pro-ceedings of the 2015 IEEE International Parallel and Dis-tributed Processing Symposium Workshop , 2015.[11] W. Feng and T. S. Abdelrahman. A sampling based strategy toautomatic performance tuning of gpu programs. In , 2017.[12] J. Filipoviˇc, J. Hozzov´a, A. Nezarat, J. Oˇlha, and F. Petroviˇc.Searching CUDA code autotuning spaces with hardware per-formance counters: data from benchmarks running on variousGPU architectures.

Data in Brief , 2021 in press.[13] J. Filipoviˇc, F. Petroviˇc, and S. Benkner. Autotuning ofOpenCL kernels with global optimizations. In

Proceedings ofthe 1st Workshop on AutotuniNg and aDaptivity AppRoachesfor Energy Eﬃcient HPC Systems (ANDARE ’17) , 2017.[14] M. Gerndt, E. Cesar, and S. Benkner. Automatic tuning ofhpc applications - the periscope tuning framework (ptf). In

Automatic Tuning of HPC Applications - The Periscope TuningFramework (PTF).

Shaker Verlag, 2015.

15] S. G. D. Gonzalo, S. D. Hammond, C. R. Trott, and W. M.a. Hwu. Revisiting online autotuning for sparse-matrix vec-tor multiplication kernels on next-generation architectures. In , 2017.[16] S. Hong and H. Kim. An analytical model for a gpu architec-ture with memory-level and thread-level parallelism awareness.

SIGARCH Comput. Archit. News , 37(3):152–163, 2009.[17] J. Huangi, J. H. Lee, H. Kim, and H. S. Lee. Gpumech: Gpuperformance modeling technique based on interval analysis. In , 2014.[18] W. Jia, K. A. Shaw, and M. Martonosi. Starchart: Hardwareand software optimization using recursive partitioning regres-sion trees. In

Proceedings of the 22nd International Conferenceon Parallel Architectures and Compilation Techniques , 2013.[19] E. Konstantinidisi and Y. Cotronis. A practical performancemodel for compute and memory bound gpu kernels. In , 2015.[20] J. Kurzak, S. Tomov, and J. Dongarra. Autotuning GEMMkernels for the Fermi GPU.

IEEE Transactions on Parallel andDistributed Systems , 23(11):2045–2057, 2012.[21] A. Li, Y.C. Tay, A. Kumar, and H. Corporaal. Transit: A visualanalytical model for multithreaded machines. In

Proceedings ofthe 24th International Symposium on High-Performance Par-allel and Distributed Computing , 2015.[22] S. Madougou, A. Varbanescu, C. de Laat, and R. van Nieuw-poort. The landscape of gpgpu performance modeling tools.

Parallel Computing , 56:18 – 33, 2016.[23] S. Muralidharan, A. Roy, M. Hall, M. Garland, and P. Rai.Architecture-adaptive code variant tuning.

SIGARCH Com-puter Architecture News , 44(2):325–338, 2016.[24] C. Nugteren. CLBlast: A tuned OpenCL BLAS library. In

Pro-ceedings of the International Workshop on OpenCL , IWOCL’18, pages 5:1–5:10. ACM, 2018.[25] C. Nugteren and V. Codreanu. CLTune: A generic auto-tunerfor OpenCL kernels. In

Proceedings of the IEEE 9th Interna-tional Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) , 2015.[26] J. Oˇlha, J. Hozzov´a, J. Fousek, and J. Filipoviˇc. Exploitinghistorical data: Pruning autotuning spaces and estimating thenumber of tuning steps.

Concurrency and Computation: Prac-tice and Experience , 32(21), 2020.[27] F. Petroviˇc, D. Stˇrel´ak, J. Hozzov´a, J. Oˇlha, R. Trembeck´y,S. Benkner, and J. Filipoviˇc. A benchmark set of highly-eﬃcientCUDA and OpenCL kernels and its dynamic autotuning withkernel tuning toolkit.

Future Generation Computer Systems ,108:161–177, 2020.[28] J Price and S. McIntosh-Smith. Improving auto-tuning conver-gence times with dynamically generated predictive performancemodels. In , 2015.[29] J. R. Quinlan. Induction of decision trees.

Machine Learning ,1:81–106, 1986.[30] S. Rahman, M. Burtscher, Z. Zong, and A. Qasem. Maxi-mizing hardware prefetch eﬀectiveness with machine learning.In , 2015.[31] A. Rasch and S. Gorlatch. ATF: A generic directive-based auto-tuning framework.

Concurrency and Computation: Practiceand Experience , 0(0):e4423, 2018.[32] A. Roy, P. Balaprakash, P. D. Hovland, and S. M. Wild. Ex-ploiting performance portability in search algorithms for auto-tuning. In , 2016.[33] K. Seymour, H. You, and J. Dongarra. A comparison of searchheuristics for empirical code optimization. In , 2008.[34] Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, andRichard Vuduc. A performance analysis framework for identi-fying potential beneﬁts in gpgpu applications.

SIGPLAN Not. ,47(8):11–22, 2012.[35] D. Stˇrel´ak, C. O. S. Sorzano, J. M. Carazo, and J. Filipoviˇc. AGPU acceleration of 3D Fourier reconstruction in Cryo-EM.

TheInternational Journal of High Performance Computing Appli-cations , 0, 2019.[36] P. Tillet and D. Cox. Input-aware auto-tuning of compute-bound hpc kernels. In

Proceedings of the International Con-ference for High Performance Computing, Networking, Storageand Analysis , 2017.[37] V. Volkov.

Understanding Latency Hiding on GPUs . PhD the-sis, University of California at Berkeley, 2016.[38] Z. Wang and M. F.P. O’Boyle. Mapping parallelism to multi-cores: A machine learning based approach.

SIGPLAN Not. ,44(4):75–84, 2009.[39] B. van Werkhoven. Kernel tuner: A search-optimizing gpu codeauto-tuner.

Future Generation Computer Systems , 90:347 –358, 2019.[40] G. Wu, L. J. Greathouse, A. Lyashevsky, N. Jayasena, andD. Chiou. Gpgpu performance and power estimation using ma-chine learning. In , 2015.[41] C. Yu and S. Tsao. Eﬃcient and portable workgroup size tun-ing.

IEEE Transactions on Parallel and Distributed Systems ,31(2):455–469, 2020.[42] Y. Zhang and J. D. Owens. A quantitative performance analysismodel for gpu architectures. In , 2011., 2011.