[PDF] Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

Abstract

We have developed several autotuning benchmarks in CUDA that take into account performance-relevant source-code parameters and reach near peak-performance on various GPU architectures. We have used them during the development and evaluation of a novel search method for tuning space proposed in [1]. With our framework Kernel Tuning Toolkit, freely available at Github, we measured computation times and hardware performance counters on several GPUs for the complete tuning spaces of five benchmarks. These data, which we provide here, might benefit research of search algorithms for the tuning spaces of GPU codes or research of relation between applied code optimization, hardware performance counters, and GPU kernels' performance. Moreover, we describe the scripts we used for robust evaluation of our searcher and comparison to others in detail. In particular, the script that simulates the tuning, i.e., replaces time-demanding compiling and executing the tuned kernels with a quick reading of the computation time from our measured data, makes it possible to inspect the convergence of tuning search over a large number of experiments. These scripts, freely available with our other codes, make it easier to experiment with search algorithms and compare them in a robust way. During our research, we generated models for predicting values of performance counters from values of tuning parameters of our benchmarks. Here, we provide the models themselves and describe the scripts we implemented for their training. These data might benefit researchers who want to reproduce or build on our research.

Full PDF

JJiˇr´ı Filipoviˇc et al / Data in Brief (2021) // dx.doi.org / / j.dib.xxxx.xx.xxx2352-3409 / © Data in Brief

Data Article

Searching CUDA code autotuning spaces with hardwareperformance counters: data from benchmarks running on variousGPU architectures

Jiˇr´ı Filipoviˇc a, ∗ , Jana Hozzov´a a , Amin Nezarat a , Jaroslav Ol’ha a , Filip Petroviˇc a a Institute of Computer Science, Masaryk University, Botanick´a 68a, 60200 Brno, Czech Republic

A R T I C L E I N F O

Keywords: auto-tuningtuning spacesperformance counterscuda

A B S T R A C TWe have developed several autotuning benchmarks in CUDAthat take into account performance-relevant source-code param-eters and reach near peak-performance on various GPU archi-tectures. We have used them during the development and eval-uation of a novel search method for tuning space proposed in[1]. With our framework Kernel Tuning Toolkit, freely avail-able at Github, we measured computation times and hardwareperformance counters on several GPUs for the complete tuningspaces of ﬁve benchmarks. These data, which we provide here,might beneﬁt research of search algorithms for the tuning spacesof GPU codes or research of relation between applied code op-timization, hardware performance counters, and GPU kernels’performance.Moreover, we describe the scripts we used for robust evalu-ation of our searcher and comparison to others in detail. In par-ticular, the script that simulates the tuning, i.e., replaces time-demanding compiling and executing the tuned kernels with aquick reading of the computation time from our measured data,makes it possible to inspect the convergence of tuning searchover a large number of experiments. These scripts, freely avail-able with our other codes, make it easier to experiment withsearch algorithms and compare them in a robust way.During our research, we generated models for predicting val-ues of performance counters from values of tuning parameters of a r X i v : . [ c s . D C ] F e b ur benchmarks. Here, we provide the models themselves anddescribe the scripts we implemented for their training. Thesedata might beneﬁt researchers who want to reproduce or buildon our research. Data Article template

Speciﬁcations Table

Subject Computer ScienceSpeciﬁc subject area Auto-tuning GPU kernels using hardware performance countersType of data TablesPython and R scriptsHow data were acquired

Raw autotuning data:

Using our autotuning framework KTT, we measured computationtime and collected hardware performance counters for whole tuning spaces of ﬁve bench-mark CUDA codes on four GPUs. Kernel Tuning Toolkit (KTT) is freely available inGithub repository https://github.com/HiPerCoRe/KTT , ﬁve benchmarks are also therein ’examples’ folder. These benchmarks cover a wide range of computational problems:computing convolution, Coulomb summation in three dimensions, matrix multiplication,matrix transposition and N-body problem. They also di ﬀ er in sizes of their tuning spaces. Prediction models:

Using our scripts, also available in KTT repository at Github, wetrained models with the raw tuning data.Data format RawAnalyzedScriptsParameters for datacollection

Raw autotuning data:

Computation time and performance counters were measured forﬁve benchmarks (GEMM, Convolution, Matrix transposition, 3D Coulomb summationand n-body) bundled with KTT, git tag v1.3-profile-searcher . We ran them on fourGPUs: GeForce GTX 680, GeForce GTX 750, GeForce GTX 1070 and GeForce RTX2080. KTT was conﬁgured to perform exhaustive exploration of tuning spaces on eachGPU under our test with proﬁling switched on. Size of input for each benchmark waschosen so that the kernel execution took 1-10 milliseconds. For the GEMM benchmark,data for several input sizes were collected on GTX GeForce 1070.

Prediction models:

Prediction models were trained for all hardware performance coun-ters, local and global size.Description of datacollection

Raw autotuning data:

KTT performed exhaustive exploration of complete tuning spaces(sets of all executable tuning conﬁgurations) of tested benchmarks for each GPU. Eachtuning conﬁguration contains information about tuning parameters (a ﬀ ecting how GPUkernel is created and executed), the runtime of the kernel and hardware performancecounters provided by NVIDIA CUPTI library. Tuning conﬁgurations which cannot beexecuted on a particular GPU are not stored. Prediction models:

Models were created from the raw autotuning data withscripts create nonlinear models.R and generate-knowledge-base.py available withproﬁle-based searcher in KTT.Data source location Institute of Computer Science, Masaryk UniversityBrnoCzech Republic49.211N, 16.598E ∗ Corresponding author e-mail: [email protected] (Jiˇr´ı Filipoviˇc), [email protected] (Jana Hozzov´a), [email protected] (Amin Nezarat), (Jaroslav Ol’ha), [email protected] (Filip Petroviˇc) iˇr´ı Filipoviˇc et al / Data in Brief (2021) Data accessibility Repository name: Mendeley DataData identiﬁcation number: doi:10.17632 / nn53dskr7z.1Direct URL to data: http: // dx.doi.org / / nn53dskr7z.1Related researcharticle Filipoviˇc, J., Hozzov´a, J., Nezarat, A., Ol’ha, J., Petroviˇc, F., Using hardware performancecounters to speed up autotuning convergence on GPUs, Future Generation Computer Sys-tems. In Press. Value of the Data • Raw autotuning data contain, to the best of our knowledge, the ﬁrst freely available com-plete tuning spaces of several CUDA kernels prepared for autotuning, alongside theircomputation times and hardware performance counters measurements on several GPUs.Scripts make it easier to experiment with searching tuning space in a controlled environ-ment, so the results of searchers are comparable. • These data will help those researching how to search the tuning spaces of GPU codes orthose interested in mining the data related to hardware performance counters and GPUkernels’ performance. • With raw autotuning data, new search algorithms for navigating the tuning spaces can beeasily evaluated for multiple GPUs (even those unavailable to the researchers), skippinghigh time demands of actually compiling, running and measuring. Moreover, the globaloptimum of the tuning space is known from data. • With scripts for simulated and real-time tuning, the results of others (with a new searchmethod or a new prediction model for performance counters) can be consistently comparedto the results of our searcher. • Availability of KTT autotuner and scripts for model preparation allows users to expandour dataset by measurement on their own GPUs, or their own benchmarks.

Data Description

Raw Autotuning Data

Raw autotuning data were produced by Kernel Tuning Toolkit 1.3 running on GPUs listed inTable 2. For each benchmark listed in Table 3 (available in KTT repository in folder examples as cltune-conv, coulomb sum 3d, cltune-gemm, mtran and nbody), the exhaustive search of thewhole tuning space was executed, measuring computation time and hardware performance coun-ters. For details on benchmarks and their tuning spaces, see [2].The raw data are available at http://dx.doi.org/10.17632/nn53dskr7z.1 in directory’raw-autotuning-data’. They are stored as CSV ﬁles with naming convention containing the ab-breviation of GPU, the abbreviation of benchmark, and su ﬃ x output.csv . For example, dataobtained on GeForce GTX 1070 and N-body benchmark are stored in . There are special cases for GEMM benchmark, where we ob-tained data on small and highly-rectangular matrices. Those benchmarks are abbreviated as (multiplication of 128 ×

128 matrices), https://github.com/HiPerCoRe/KTT/releases/tag/v1.3-profile-searcher Jiˇr´ı Filipoviˇc et al / Data in Brief (2021)

Table 2.

GPU devices used to obtain our data.

Device Architecture Released AbbreviationGeForce GTX 680 Kepler 2012 680GeForce GTX 750 Maxwell 2014 750GeForce GTX 1070 Pascal 2016 1070GeForce RTX 2080 Turing 2018 2080

Table 3.

A list of the benchmarks used to obtain our data.

Benchmark Description AbbreviationConvolution 2D convolution kernel using 7 × × × (multi-plication of matrix 4096 ×

16 with matrix 4096 × (multi-plication of matrix 4096 × × • the ﬁrst line is the header containing names of columns; • each other line contains the proﬁle of one tuning conﬁguration (a combination of tuningparameters, which produces unique CUDA kernel source code and execution setting); • if some conﬁguration cannot be executed on a given GPU (e.g., because of insu ﬃ cienthardware resources), it is not included in CSV (therefore, the same benchmarks can pro-duce CSV ﬁles with a di ﬀ erent number of lines when executed on di ﬀ erent GPUs).Each line of the CSV ﬁle contains the following types of columns: • Kernel name : the name of the benchmarked kernel (the same for one type of benchmark); • Computation duration ( µ s) : the duration of the benchmarked kernel and unit the time ismeasure in; • Global size and

Local size : The global and local size of the executed kernel (number ofthreads and block size in CUDA terminology). The size is counted as a scalar number; itreﬂects an overall number of threads with no respect to the grid or block dimensionality; • Tuning parameters: the benchmark-speciﬁc tuning parameters, named in capitals by ourconvention (e.g.,

VECTOR TYPE or CR ); • Hardware performance counters: performance counters measured on particular GPU (e.g., dram utilization or inst fp 32 ). iˇr´ı Filipoviˇc et al / Data in Brief (2021) Table 4.

Input sizes used for gathering raw tuning data. The shown number determines size of input matrix / matrices inboth dimensions with conv, gemm-reduced and mtran benchmarks. With nbody benchmark, the shown number deter-mines number of simulated bodies. Finally, with coulomb benchmark, the ﬁrst number is size of a 3D grid (the same inall dimensions), whereas the second number determines the number of atoms.

680 750 1070 2080conv 2048 4096 4096 4096coulomb 256, 64 256, 64 256, 64 256, 64gemm-reduced 1024 1024 1024 2048mtran 8192 8192 8192 8192nbody 16384 16384 16384 16384Please note that not all available hardware performance counters were measured due to timedemand to measure the complete tuning space. The performance counters set di ﬀ ers from GPUto GPU because di ﬀ erent architectures implement di ﬀ erent performance counters. The biggestchange is with GeForce RTX 2080, where the performance counters are completely re-designedand re-named.Input size for kernels was selected so that the kernel execution took approximately 1 - 10milliseconds. These sizes obviously di ﬀ er for each benchmark and GPU, Table 4 summarizesthem. Prediction Models for Performance Counters

We provide pre-computed prediction models for performance counters. All predict the globalsize, the local size, and performance counters relevant for the GPU of the training raw autotuningdata. For a detailed list of performance counters and implemented models, see [1].

Least-squares Nonlinear Models

Nonlinear prediction models were produced with the script create least squares models.R bundled with KTT in ’proﬁle-searcher / scripts-prep / ’. For each raw tuning data ﬁle (i.e. for eachbenchmark and each GPU, and for all input sizes of gemm benchmark on GeForce GTX 1070),we ran the script, producing multiple models for each performance counter, each for a di ﬀ erentcombination of values of binary tuning parameters. Please see the section Experimental Design,Materials and Methods for details on how the models are generated. It might make it easier tounderstand the format of the model ﬁles. The least-squares nonlinear models are stored as CSVﬁle, following a similar naming convention as raw autotuning data: the abbreviation of GPU, theabbreviation of benchmark and su ﬃ x -model [number].csv . The special cases for GEMMbenchmark are named analogously as their raw tuning data ﬁles.The CSV ﬁles produced by the script contain three sections. The ﬁrst section includes a linefor each tuning parameter describing an expression for coding this parameter, as coded values oftuning parameters are used to predict the values of performance counters). The second sectionincludes one line called Condition describing a logical condition of values of binary parametersthis model was trained for. Furthermore, the third section includes a line for each performancecounter, describing an expression for predicting the given performance counter’s value with thecoded values of tuning parameters.

Decision Tree

Decision-tree prediction models were produced with the script generate decision tree model.py bundled with KTT in ’proﬁle-searcher / scripts-prep / ’. The script takes raw tuning data as an in- Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) put and creates a predictive model of performance counters. Please see the section ExperimentalDesign, Materials and Methods for details on how the models are generated. The resulting de-cision tree is stored as a pickle ﬁle together with a CSV ﬁle containing a list of all performancecounters predicted by the model.

Experimental Design, Materials and Methods

Obtaining Raw Tuning Data via Kernel Tuning Toolkit

The raw tuning data are obtained during an autotuning process performed by Kernel TuningToolkit with GPU of our interest. Note that we recommend using version tagged v1.3-proﬁle-searcher, which couples KTT 1.3 with proﬁle-based searcher and benchmarks listed in Table 3prepared for collecting raw data or executing tuning with the proﬁle-based searcher. We canexplore the tuning space of any benchmark bundled with KTT (in ’examples’ folder) or user-provided code. To obtain tuning data with hardware performance counters, KTT has to be builtwith enabled proﬁling (e.g., by calling premake5 --profiling=cupti gmake , see KTT doc-umentation for further details). Moreover, the proﬁling has to be switched on in the autotunedapplication by calling ktt::Tuner::setKernelProfiling() method (the proﬁling in bench-marks bundled with KTT can be switched on by setting

USE PROFILING macro to 1).KTT can explore either the entire tuning space (default behaviour) or only its subset. Inthe latter case, it is recommended to use random searcher to randomize the observed subset(using method ktt::Tuner::setSearcher() ). After the search of the space is complete, thetuning data are stored in CSV ﬁles by method ktt::Tuner::PrintResult() ). All benchmarksbundled with KTT stores resulting CSV and can execute exhaustive exploration of the tuningspace by setting preprocessor macro

EXHAUSTIVE SEARCH to 1. For more information aboutKTT methods and implementation of new autotuned codes, we refer to its documentation . Generating Prediction Models from Raw Tuning Data

We provide scripts to generate prediction models from raw tuning data. These scripts taketuning space of the problem with collected performance counters and train a model that predictsperformance counters’ values when given a tuning conﬁguration.

Generating Least-square Regression Non-linear Models

We provide two scripts in ’proﬁle-searcher / scripts-prep’ folder in the Github KTT repository.The main script create least squares models.R trains nonlinear models: it takes thetuning data, and for each performance counter, it generates a model that predicts its value basedon values of tuning parameters. To increase the accuracy of such prediction, we divide the tuningspace into subspaces based on values of binary tuning parameters, as we suspect these have aprofound inﬂuence on the performance counters. Thus, we generate several models for eachperformance counter, each model applicable only for a given combination of values of binarytuning parameters. An example of its usage is shown in Listing 1. We recommend R v3.4.4, nospecial Rcran libraries are necessary. Listing 1.

Generating nonlinear models for GEMM benchmark on GTX GeForce 1070

Rscript ./create_least_squares_models.R 1070-gemm-reduced_output.csv \1070-gemm-reduced 4:19 2,3,19:62 https://github.com/HiPerCoRe/KTT iˇr´ı Filipoviˇc et al / Data in Brief (2021) It takes four arguments: • [input file name] e.g. 1070-gemm-reduced output.csv, must follow the formatting ofraw tuning data, as described above in section Data Description; • [prefix for output files names] e.g. 1070-gemm-reduced, this will be used toname output ﬁles with models, -model [number].csv will be added; • [numbers of columns with tuning parameters in input file] in format allow-ing to set individual columns and interval of columns (in format ’from:to’) e.g. 2,5:12meaning columns 2 and 5 through 11, counting starts at 0; • [numbers of columns with performance counters in input file] in the sameformat.After parsing the script arguments and reading the input ﬁle, we code the tuning parameters’values, i.e., scale them to the range of < − , > . This step is recommended in any regressionmodel design, as models generally do not work well with the absolute values of the factors. Next,we select the values of tuning parameters that will determine training data. In other words, we donot choose data points (rows from the input ﬁle) for training randomly. We select a few values ofnon-binary tuning parameters and then include all available combinations in the training dataset.We need to moderate the number of values to prevent an exponential increase in training datasize or a poor sampling of some part of the tuning space due to constraints.The function takes two arguments: the formula and the training data. The formula includesthe factors, i.e. coded tuning parameters, and arithmetic operations with them. To make the mod-els nonlinear, we include multiplications of factors (to capture their interactions) and quadraticterms. The training data include rows from the input ﬁle with selected values of tuning parame-ters and corresponding values of the given proﬁling counter.The output of the script are multiple ﬁles named [output name]-model [number].csv .The number of models corresponds to the number of combinations of values of binary param-eters. If a model cannot be created for a speciﬁc combination of binary parameters (e.g. thereare no data due to constraints), the closest model (i.e., the minimal number of values of binarytuning parameters di ﬀ ers) ﬁlls in and is printed in the output ﬁle. The format and contents ofmodel ﬁles are described in the above section Data Description.The script generate least squares models.py makes it quick and easy to generate mod-els for all our raw tuning data. It requires python3, with docopt library installed. It takes one ar-gument, --benchmark . The option --benchmark GPU generates models for benchmarks conv,coulomb, gemm-reduced, mtran and nbody for all GPUs. Option --benchmark GEMM generatesmodels for di ﬀ erent input sizes of benchmark gemm-reduced on GeForce GTX 1070. Users mayneed to modify the script to accommodate the names of folders with data. Generating Decision Trees

The script generate decision tree model.py for decision-tree model preparation is storedin ’proﬁle-searcher / scripts-prep’ folder in a KTT distribution. While generating the model, usershave to supply the CSV ﬁle containing explored tuning space, columns containing tuning pa-rameters and proﬁling counters with parallelism conﬁguration (”Global size” and ”Local size”columns) in the same format as with least-square regression nonlinear models, see Listing 2. Listing 2.

How to call script for generating nonlinear model python3 generate_decision_tree_model.py \

Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) -i 1070-gemm-reduced_output.csv -t 4:19 -c 2,3,19:62

The script builds models predicting performance counters using optimized decision trees.The performance counters prediction with decision trees is computationally more e ﬃ cient thanwith least-squares model; therefore, we recommend them as a default choice. The decisiontrees have high precision in densely sampled tuning spaces, but they are poor in extrapolation.Therefore, if only a smaller part of the tuning space is sampled, we recommend testing whetherthe least-squares method would bring better precision and faster tuning convergence.The resulting decision tree is stored in ﬁles with su ﬃ x DT.sav and a list of hardware perfor-mance counters predicted by the script has su ﬃ x DT.sav.pc . For example, model for the ﬁle is stored in . Execution of Simulated Tuning

Simulated tuning scripts simulated-profiling-searcher.py perform a search of theautotuning space on a pre-computed tuning space. It requires auxiliary ﬁles base.py and mlKTTPredictor.py distributed with KTT. It also requires python3, with installed librariesdocopt, numpy, pandas, pickle and sklearn.Instead of real execution and proﬁling of autotuned kernels (obtaining their runtime andhardware performance counters), it reads stored raw autotuning data (i.e. just simulates theirexecution and proﬁling) and performs a search on them. The advantage of this approach isthat it is performed much faster than real tuning (as no compilation, execution and proﬁling areperformed); therefore, the simulated tuning experiments can be performed many times to getstatistically relevant data. The simulated run also does not require installation of KTT and GPUwe are simulating autotuning for. The convergence of search method is measured and can becompared in means of the number of search steps (equal to the number of kernel executionsperformed by KTT in a real environment).Listing 3 shows three examples of running the script. We want to tune gemm-reduced bench-mark. The models for predicting performance counters were trained on GeForce GTX 750, andwe want to use them to guide the search on GeForce GTX 1070. These three commands di ﬀ erin the model used. The ﬁrst one does not predict anything, only reads the performance counters’values from the provided raw tuning data ﬁle. The second one uses a decision tree, and the thirdone uses least-squares nonlinear models. Listing 3.

Example of simulated-proﬁling-searcher.py. python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--cm 750-gemm-reduced_output.csv --ic 5.0 -p 1 -t 4:19 -c 2,3,19:62 \--compute_bound -e 1000 -i 1000python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--dt 750-gemm-reduced_output_DT.sav --ic 5.0 -p 1 \-t 4:19 -c 2,3,19:62 --compute_bound -e 1000 -i 1000python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--ls 750-gemm-reduced --ic 5.0 -p 1 -t 4:19 -c 2,3,19:62 \--compute_bound -e 1000 -i 1000

The script takes multiple arguments: iˇr´ı Filipoviˇc et al / Data in Brief (2021) • -o [raw tuning data file] the raw tuning data ﬁle following the format describedin the section Data Description • --oc [compute capability of GPU used to produce raw tuning data] e.g. 6.1,if raw tuning data came from GeForce GTX 1070 • --mp [number of multiprocessors on that GPU] e.g. 15, if raw tuning data camefrom GeForce GTX 1070 • --co [number of CUDA cores on that GPU] e.g. 1920, if raw tuning data camefrom GeForce GTX 1070 • one of the following – --cm [raw tuning data file] with this option, no prediction of values of per-formance counters is computed, their actual values are read from the ﬁle with givenraw tuning data – --dt [decision tree model file] with this option, decision tree model is em-ployed to predict values of performance counters – --ls [prefix for least squares model files] with this option, least squaresnonlinear models are employed to predict values of performance counters • --ic [compute capability of the GPU of training data for model] e.g. 5.0,if model was trained with data from GeForce GTX 750 • -p [column with computation time] always 1 in provided raw tuning data • -t [columns with TP] • -c [columns with PC] • -e [number of experiments] sets how many times the experiment is repeated to getmore stable results in case of randomized searchers • -i [number of iterations] sets how many tuning iterations (i.e., search steps) areperformed per experiment • --compute-bound or --memory-bound e.g., --compute-bound as that is the characterof gemm-reduced problemFor details on the algorithm of proﬁle-based search, please see [1].The results of the analysis are stored as CSV ﬁles of the following format. The ﬁrst linecontains a header with names of the columns. Each column presents an iteration of the searcher(i.e., the exploration of the next tuning conﬁguration, requiring its proﬁling). The ﬁrst columncontains the iteration number. The second column contains an average runtime with the standarddeviation of the best kernel known in this iteration when the random searcher is utilized. The Note that values of several arguments, such as compute capability of GPUs, number of multiprocessor or CUDAcores, and column indexes for computation time, tuning parameters and proﬁling counters are available in script autobench.py for our raw tuning data. This script is described later in this section.0

Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) third column contains an average runtime with the standard deviation of the best kernel knownin this iteration when the proﬁle-based searcher is utilized.The script autobench.py makes it easy to run simulated tuning for all our raw tuning data.It takes two arguments, --benchmark and --method . The option --benchmark GPU runs sim-ulated tuning for benchmarks conv, coulomb, gemm-reduced, mtran and nbody for all GPUs.Option --benchmark GEMM runs simulated tuning for di ﬀ erent input sizes of benchmark gemm-reduced on GeForce GTX 1070. The option --method has three possible arguments Exact , DecisionTree or LeastSquares to denote the used model for predicting the values of per-formance counters. Users may need to modify the script to accommodate the names of folderswith data. Moreover, the script can be used as a source of information on possible values ofseveral command-line arguments of simulated-profiling-searcher.py , such as computecapabilities of di ﬀ erent GPUs, number of their multiprocessors and CUDA cores, and indexes ofcolumns for computation time, tuning parameters and performance counters in raw tuning datawe provide.We used the simulated tuning to analyze the convergence speed of the proﬁle-based searcherproposed in [1] and of the random search. During the analysis, the autotuning is performed inthe deﬁned number of iterations. In each iteration of the searching process, the runtime of thebest kernel found is logged. The autotuning is performed multiple times, so we obtain an averagespeed of the best kernel for each iteration over multiple autotuning executions.Other search methods or modiﬁcations based on our proﬁle-based searcher might be easilyadded to scripts and compared consistently. Execution Real-time Tuning

Real-time tuning performs a search of the autotuning space without a pre-computed tuningspace. The autotuned kernels are actually compiled, executed and proﬁled during the search.This is far more demanding than the simulated tuning described above, but it makes it possibleto measure the actual time per tuning search iteration. The convergence of search method ismeasured and can be compared in means of tuning time.Real-time tuning can be executed by running a compiled benchmark. For benchmarks bun-dled with KTT, the preparation includes the following steps: • KTT needs to be compiled with proﬁling. i.e. premake5 --profiling=cupti gmake needs to be run before building it. In case of older GPU architectures,use --profiling=cupti-legacy instead. • In the code of the benchmark (in cpp ﬁle), a preprocessor macro

EXHAUSTIVE SEARCH has to be set to 0. • The random search is used by default. To test proﬁle-based searcher [1], the macro

USE PROFILE SEARCHER has to be set to 1 in the code of the benchmark (in cpp ﬁle). • The time for autotuning is restricted to a certain value set by macro

TUNE SEC . This timecan be altered, for example, with

TUNE SEC 60 performs autotuning for 60 seconds. • The prediction models needed for proﬁle-based searcher have to be in ’KTT / proﬁle-searcher- / models’ folder.Listing 4 shows examples of running a single real-time tuning for each benchmark and savingthe log ﬁle. iˇr´ı Filipoviˇc et al / Data in Brief (2021) Listing 4.

Running a single execution of real-time tuning on every benchmark cd KTT/build/x86_64_Release/./conv_cuda > conv_experiment_1.log./coulomb_sum_3d_cuda > coulomb_experiment_1.log./gemm_cuda > gemm_experiment_1.log./mtran_cuda > mtran_experiment_1.log./nbody_cuda > nbody_experiment_1.log

The benchmark executable has two arguments; however, both of them have default values: • [platform index] default value 0, cannot be changed when using CUDA • [device index] default value 0, users may need to change this if multiple GPUs areavailableThe input size can be modiﬁed in the source code of the given benchmark, in the main function. Again, reasonable default values are set. In our evaluation in [1], we set the input sizeof benchmarks according to Table 4.Proper evaluation of searcher’s convergence in means of tuning time requires multiple runsof real-time tuning. This can be easily done by executing the benchmark multiple times andgenerating a log ﬁle for each run. For easier processing of the results in multiple log ﬁles, weprovide script histogram.py . See an example of its usage in Listings 5. Listing 5.

Processing multiple runs of real-time tuning python3 ./histogram.py -s gemm-experiments -t 300

It takes two arguments: • -s [folder with log files from real-time tuning runs] ; • -t [time] in seconds that denotes the maximum running time that is analyzed.Results of histogram.py script are stored as CSV ﬁles of the following format. The ﬁrstline is a header containing names of the columns. Each row contains measurement in each secondof the autotuning process. The missing row indicates that no new data was available in particularsecond of the tuning process (this may happen in the initial part of the tuning if the ﬁrst proﬁledkernel runs a long time). The rows contain the following columns: • the time (in seconds) from the beginning of the autotuning; • average runtime of the best kernel known in the corresponding time; • standard deviation of the runtime; • minimal runtime; • maximal runtime.The data are collected from all available executions of the autotuning (all log ﬁles in folderpassed by argument -s ). Jiˇr´ı Filipoviˇc et al / Data in Brief (2021)

Acknowledgments

The work was supported from European Regional Development Fund-Project ”CERIT Sci-entiﬁc Cloud” (No. CZ.02.1.01 / / /

16 013 / Declaration of Competing Interest

The authors declare that they have no known competing ﬁnancial interests or personal rela-tionships which have, or could be perceived to have, inﬂuenced the work reported in this article.

References [1] J. Filipoviˇc, J. Hozzov´a, A. Nezarat, J. Ol’ha, F. Petroviˇc, Using hardware performance counters to speed upautotuning convergence on GPUs, Future Generation Computer Systems In Press (2021).[2] F. Petroviˇc, D. Stˇrel´ak, J. Hozzov´a, J. Ol’ha, R. Trembeck´y, S. Benkner, J. Filipoviˇc, A benchmark set of highly-e ﬃ cient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit, Future GenerationComputer Systems 108 (2020) 161–177.[3] C. Nugteren, V. Codreanu, CLTune: A generic auto-tuner for OpenCL kernels, in: Proceedings of the IEEE 9thInternational Symposium on Embedded Multicore / Many-core Systems-on-Chip (MCSoC), 2015.[4] J. Filipoviˇc, F. Petroviˇc, S. Benkner, Autotuning of OpenCL kernels with global optimizations, in: Proceedings ofthe 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy E ﬃ cient HPC Systems (ANDARE ’17),2017.[5] C. Nugteren, CLBlast: A tuned OpenCL BLAS library, in: Proceedings of the International Workshop on OpenCL,IWOCL ’18, ACM, 2018, pp. 5:1–5:10. doi:10.1145/3204919.3204924