Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures
Ji?í Filipovi?, Jana Hozzová, Amin Nezarat, Jaroslav Oľha, Filip Petrovi?
JJiˇr´ı Filipoviˇc et al / Data in Brief (2021) // dx.doi.org / / j.dib.xxxx.xx.xxx2352-3409 / © Data in Brief
Data Article
Searching CUDA code autotuning spaces with hardwareperformance counters: data from benchmarks running on variousGPU architectures
Jiˇr´ı Filipoviˇc a, ∗ , Jana Hozzov´a a , Amin Nezarat a , Jaroslav Ol’ha a , Filip Petroviˇc a a Institute of Computer Science, Masaryk University, Botanick´a 68a, 60200 Brno, Czech Republic
A R T I C L E I N F O
Keywords: auto-tuningtuning spacesperformance counterscuda
A B S T R A C TWe have developed several autotuning benchmarks in CUDAthat take into account performance-relevant source-code param-eters and reach near peak-performance on various GPU archi-tectures. We have used them during the development and eval-uation of a novel search method for tuning space proposed in[1]. With our framework Kernel Tuning Toolkit, freely avail-able at Github, we measured computation times and hardwareperformance counters on several GPUs for the complete tuningspaces of five benchmarks. These data, which we provide here,might benefit research of search algorithms for the tuning spacesof GPU codes or research of relation between applied code op-timization, hardware performance counters, and GPU kernels’performance.Moreover, we describe the scripts we used for robust evalu-ation of our searcher and comparison to others in detail. In par-ticular, the script that simulates the tuning, i.e., replaces time-demanding compiling and executing the tuned kernels with aquick reading of the computation time from our measured data,makes it possible to inspect the convergence of tuning searchover a large number of experiments. These scripts, freely avail-able with our other codes, make it easier to experiment withsearch algorithms and compare them in a robust way.During our research, we generated models for predicting val-ues of performance counters from values of tuning parameters of a r X i v : . [ c s . D C ] F e b ur benchmarks. Here, we provide the models themselves anddescribe the scripts we implemented for their training. Thesedata might benefit researchers who want to reproduce or buildon our research. Data Article template
Specifications Table
Subject Computer ScienceSpecific subject area Auto-tuning GPU kernels using hardware performance countersType of data TablesPython and R scriptsHow data were acquired
Raw autotuning data:
Using our autotuning framework KTT, we measured computationtime and collected hardware performance counters for whole tuning spaces of five bench-mark CUDA codes on four GPUs. Kernel Tuning Toolkit (KTT) is freely available inGithub repository https://github.com/HiPerCoRe/KTT , five benchmarks are also therein ’examples’ folder. These benchmarks cover a wide range of computational problems:computing convolution, Coulomb summation in three dimensions, matrix multiplication,matrix transposition and N-body problem. They also di ff er in sizes of their tuning spaces. Prediction models:
Using our scripts, also available in KTT repository at Github, wetrained models with the raw tuning data.Data format RawAnalyzedScriptsParameters for datacollection
Raw autotuning data:
Computation time and performance counters were measured forfive benchmarks (GEMM, Convolution, Matrix transposition, 3D Coulomb summationand n-body) bundled with KTT, git tag v1.3-profile-searcher . We ran them on fourGPUs: GeForce GTX 680, GeForce GTX 750, GeForce GTX 1070 and GeForce RTX2080. KTT was configured to perform exhaustive exploration of tuning spaces on eachGPU under our test with profiling switched on. Size of input for each benchmark waschosen so that the kernel execution took 1-10 milliseconds. For the GEMM benchmark,data for several input sizes were collected on GTX GeForce 1070.
Prediction models:
Prediction models were trained for all hardware performance coun-ters, local and global size.Description of datacollection
Raw autotuning data:
KTT performed exhaustive exploration of complete tuning spaces(sets of all executable tuning configurations) of tested benchmarks for each GPU. Eachtuning configuration contains information about tuning parameters (a ff ecting how GPUkernel is created and executed), the runtime of the kernel and hardware performancecounters provided by NVIDIA CUPTI library. Tuning configurations which cannot beexecuted on a particular GPU are not stored. Prediction models:
Models were created from the raw autotuning data withscripts create nonlinear models.R and generate-knowledge-base.py available withprofile-based searcher in KTT.Data source location Institute of Computer Science, Masaryk UniversityBrnoCzech Republic49.211N, 16.598E ∗ Corresponding author e-mail: [email protected] (Jiˇr´ı Filipoviˇc), [email protected] (Jana Hozzov´a), [email protected] (Amin Nezarat), (Jaroslav Ol’ha), [email protected] (Filip Petroviˇc) iˇr´ı Filipoviˇc et al / Data in Brief (2021) Data accessibility Repository name: Mendeley DataData identification number: doi:10.17632 / nn53dskr7z.1Direct URL to data: http: // dx.doi.org / / nn53dskr7z.1Related researcharticle Filipoviˇc, J., Hozzov´a, J., Nezarat, A., Ol’ha, J., Petroviˇc, F., Using hardware performancecounters to speed up autotuning convergence on GPUs, Future Generation Computer Sys-tems. In Press. Value of the Data • Raw autotuning data contain, to the best of our knowledge, the first freely available com-plete tuning spaces of several CUDA kernels prepared for autotuning, alongside theircomputation times and hardware performance counters measurements on several GPUs.Scripts make it easier to experiment with searching tuning space in a controlled environ-ment, so the results of searchers are comparable. • These data will help those researching how to search the tuning spaces of GPU codes orthose interested in mining the data related to hardware performance counters and GPUkernels’ performance. • With raw autotuning data, new search algorithms for navigating the tuning spaces can beeasily evaluated for multiple GPUs (even those unavailable to the researchers), skippinghigh time demands of actually compiling, running and measuring. Moreover, the globaloptimum of the tuning space is known from data. • With scripts for simulated and real-time tuning, the results of others (with a new searchmethod or a new prediction model for performance counters) can be consistently comparedto the results of our searcher. • Availability of KTT autotuner and scripts for model preparation allows users to expandour dataset by measurement on their own GPUs, or their own benchmarks.
Data Description
Raw Autotuning Data
Raw autotuning data were produced by Kernel Tuning Toolkit 1.3 running on GPUs listed inTable 2. For each benchmark listed in Table 3 (available in KTT repository in folder examples as cltune-conv, coulomb sum 3d, cltune-gemm, mtran and nbody), the exhaustive search of thewhole tuning space was executed, measuring computation time and hardware performance coun-ters. For details on benchmarks and their tuning spaces, see [2].The raw data are available at http://dx.doi.org/10.17632/nn53dskr7z.1 in directory’raw-autotuning-data’. They are stored as CSV files with naming convention containing the ab-breviation of GPU, the abbreviation of benchmark, and su ffi x output.csv . For example, dataobtained on GeForce GTX 1070 and N-body benchmark are stored in . There are special cases for GEMM benchmark, where we ob-tained data on small and highly-rectangular matrices. Those benchmarks are abbreviated as (multiplication of 128 ×
128 matrices), https://github.com/HiPerCoRe/KTT/releases/tag/v1.3-profile-searcher Jiˇr´ı Filipoviˇc et al / Data in Brief (2021)
Table 2.
GPU devices used to obtain our data.
Device Architecture Released AbbreviationGeForce GTX 680 Kepler 2012 680GeForce GTX 750 Maxwell 2014 750GeForce GTX 1070 Pascal 2016 1070GeForce RTX 2080 Turing 2018 2080
Table 3.
A list of the benchmarks used to obtain our data.
Benchmark Description AbbreviationConvolution 2D convolution kernel using 7 × × × (multi-plication of matrix 4096 ×
16 with matrix 4096 × (multi-plication of matrix 4096 × × • the first line is the header containing names of columns; • each other line contains the profile of one tuning configuration (a combination of tuningparameters, which produces unique CUDA kernel source code and execution setting); • if some configuration cannot be executed on a given GPU (e.g., because of insu ffi cienthardware resources), it is not included in CSV (therefore, the same benchmarks can pro-duce CSV files with a di ff erent number of lines when executed on di ff erent GPUs).Each line of the CSV file contains the following types of columns: • Kernel name : the name of the benchmarked kernel (the same for one type of benchmark); • Computation duration ( µ s) : the duration of the benchmarked kernel and unit the time ismeasure in; • Global size and
Local size : The global and local size of the executed kernel (number ofthreads and block size in CUDA terminology). The size is counted as a scalar number; itreflects an overall number of threads with no respect to the grid or block dimensionality; • Tuning parameters: the benchmark-specific tuning parameters, named in capitals by ourconvention (e.g.,
VECTOR TYPE or CR ); • Hardware performance counters: performance counters measured on particular GPU (e.g., dram utilization or inst fp 32 ). iˇr´ı Filipoviˇc et al / Data in Brief (2021) Table 4.
Input sizes used for gathering raw tuning data. The shown number determines size of input matrix / matrices inboth dimensions with conv, gemm-reduced and mtran benchmarks. With nbody benchmark, the shown number deter-mines number of simulated bodies. Finally, with coulomb benchmark, the first number is size of a 3D grid (the same inall dimensions), whereas the second number determines the number of atoms.
680 750 1070 2080conv 2048 4096 4096 4096coulomb 256, 64 256, 64 256, 64 256, 64gemm-reduced 1024 1024 1024 2048mtran 8192 8192 8192 8192nbody 16384 16384 16384 16384Please note that not all available hardware performance counters were measured due to timedemand to measure the complete tuning space. The performance counters set di ff ers from GPUto GPU because di ff erent architectures implement di ff erent performance counters. The biggestchange is with GeForce RTX 2080, where the performance counters are completely re-designedand re-named.Input size for kernels was selected so that the kernel execution took approximately 1 - 10milliseconds. These sizes obviously di ff er for each benchmark and GPU, Table 4 summarizesthem. Prediction Models for Performance Counters
We provide pre-computed prediction models for performance counters. All predict the globalsize, the local size, and performance counters relevant for the GPU of the training raw autotuningdata. For a detailed list of performance counters and implemented models, see [1].
Least-squares Nonlinear Models
Nonlinear prediction models were produced with the script create least squares models.R bundled with KTT in ’profile-searcher / scripts-prep / ’. For each raw tuning data file (i.e. for eachbenchmark and each GPU, and for all input sizes of gemm benchmark on GeForce GTX 1070),we ran the script, producing multiple models for each performance counter, each for a di ff erentcombination of values of binary tuning parameters. Please see the section Experimental Design,Materials and Methods for details on how the models are generated. It might make it easier tounderstand the format of the model files. The least-squares nonlinear models are stored as CSVfile, following a similar naming convention as raw autotuning data: the abbreviation of GPU, theabbreviation of benchmark and su ffi x -model [number].csv . The special cases for GEMMbenchmark are named analogously as their raw tuning data files.The CSV files produced by the script contain three sections. The first section includes a linefor each tuning parameter describing an expression for coding this parameter, as coded values oftuning parameters are used to predict the values of performance counters). The second sectionincludes one line called Condition describing a logical condition of values of binary parametersthis model was trained for. Furthermore, the third section includes a line for each performancecounter, describing an expression for predicting the given performance counter’s value with thecoded values of tuning parameters.
Decision Tree
Decision-tree prediction models were produced with the script generate decision tree model.py bundled with KTT in ’profile-searcher / scripts-prep / ’. The script takes raw tuning data as an in- Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) put and creates a predictive model of performance counters. Please see the section ExperimentalDesign, Materials and Methods for details on how the models are generated. The resulting de-cision tree is stored as a pickle file together with a CSV file containing a list of all performancecounters predicted by the model.
Experimental Design, Materials and Methods
Obtaining Raw Tuning Data via Kernel Tuning Toolkit
The raw tuning data are obtained during an autotuning process performed by Kernel TuningToolkit with GPU of our interest. Note that we recommend using version tagged v1.3-profile-searcher, which couples KTT 1.3 with profile-based searcher and benchmarks listed in Table 3prepared for collecting raw data or executing tuning with the profile-based searcher. We canexplore the tuning space of any benchmark bundled with KTT (in ’examples’ folder) or user-provided code. To obtain tuning data with hardware performance counters, KTT has to be builtwith enabled profiling (e.g., by calling premake5 --profiling=cupti gmake , see KTT doc-umentation for further details). Moreover, the profiling has to be switched on in the autotunedapplication by calling ktt::Tuner::setKernelProfiling() method (the profiling in bench-marks bundled with KTT can be switched on by setting
USE PROFILING macro to 1).KTT can explore either the entire tuning space (default behaviour) or only its subset. Inthe latter case, it is recommended to use random searcher to randomize the observed subset(using method ktt::Tuner::setSearcher() ). After the search of the space is complete, thetuning data are stored in CSV files by method ktt::Tuner::PrintResult() ). All benchmarksbundled with KTT stores resulting CSV and can execute exhaustive exploration of the tuningspace by setting preprocessor macro
EXHAUSTIVE SEARCH to 1. For more information aboutKTT methods and implementation of new autotuned codes, we refer to its documentation . Generating Prediction Models from Raw Tuning Data
We provide scripts to generate prediction models from raw tuning data. These scripts taketuning space of the problem with collected performance counters and train a model that predictsperformance counters’ values when given a tuning configuration.
Generating Least-square Regression Non-linear Models
We provide two scripts in ’profile-searcher / scripts-prep’ folder in the Github KTT repository.The main script create least squares models.R trains nonlinear models: it takes thetuning data, and for each performance counter, it generates a model that predicts its value basedon values of tuning parameters. To increase the accuracy of such prediction, we divide the tuningspace into subspaces based on values of binary tuning parameters, as we suspect these have aprofound influence on the performance counters. Thus, we generate several models for eachperformance counter, each model applicable only for a given combination of values of binarytuning parameters. An example of its usage is shown in Listing 1. We recommend R v3.4.4, nospecial Rcran libraries are necessary. Listing 1.
Generating nonlinear models for GEMM benchmark on GTX GeForce 1070
Rscript ./create_least_squares_models.R 1070-gemm-reduced_output.csv \1070-gemm-reduced 4:19 2,3,19:62 https://github.com/HiPerCoRe/KTT iˇr´ı Filipoviˇc et al / Data in Brief (2021) It takes four arguments: • [input file name] e.g. 1070-gemm-reduced output.csv, must follow the formatting ofraw tuning data, as described above in section Data Description; • [prefix for output files names] e.g. 1070-gemm-reduced, this will be used toname output files with models, -model [number].csv will be added; • [numbers of columns with tuning parameters in input file] in format allow-ing to set individual columns and interval of columns (in format ’from:to’) e.g. 2,5:12meaning columns 2 and 5 through 11, counting starts at 0; • [numbers of columns with performance counters in input file] in the sameformat.After parsing the script arguments and reading the input file, we code the tuning parameters’values, i.e., scale them to the range of < − , > . This step is recommended in any regressionmodel design, as models generally do not work well with the absolute values of the factors. Next,we select the values of tuning parameters that will determine training data. In other words, we donot choose data points (rows from the input file) for training randomly. We select a few values ofnon-binary tuning parameters and then include all available combinations in the training dataset.We need to moderate the number of values to prevent an exponential increase in training datasize or a poor sampling of some part of the tuning space due to constraints.The function takes two arguments: the formula and the training data. The formula includesthe factors, i.e. coded tuning parameters, and arithmetic operations with them. To make the mod-els nonlinear, we include multiplications of factors (to capture their interactions) and quadraticterms. The training data include rows from the input file with selected values of tuning parame-ters and corresponding values of the given profiling counter.The output of the script are multiple files named [output name]-model [number].csv .The number of models corresponds to the number of combinations of values of binary param-eters. If a model cannot be created for a specific combination of binary parameters (e.g. thereare no data due to constraints), the closest model (i.e., the minimal number of values of binarytuning parameters di ff ers) fills in and is printed in the output file. The format and contents ofmodel files are described in the above section Data Description.The script generate least squares models.py makes it quick and easy to generate mod-els for all our raw tuning data. It requires python3, with docopt library installed. It takes one ar-gument, --benchmark . The option --benchmark GPU generates models for benchmarks conv,coulomb, gemm-reduced, mtran and nbody for all GPUs. Option --benchmark GEMM generatesmodels for di ff erent input sizes of benchmark gemm-reduced on GeForce GTX 1070. Users mayneed to modify the script to accommodate the names of folders with data. Generating Decision Trees
The script generate decision tree model.py for decision-tree model preparation is storedin ’profile-searcher / scripts-prep’ folder in a KTT distribution. While generating the model, usershave to supply the CSV file containing explored tuning space, columns containing tuning pa-rameters and profiling counters with parallelism configuration (”Global size” and ”Local size”columns) in the same format as with least-square regression nonlinear models, see Listing 2. Listing 2.
How to call script for generating nonlinear model python3 generate_decision_tree_model.py \
Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) -i 1070-gemm-reduced_output.csv -t 4:19 -c 2,3,19:62
The script builds models predicting performance counters using optimized decision trees.The performance counters prediction with decision trees is computationally more e ffi cient thanwith least-squares model; therefore, we recommend them as a default choice. The decisiontrees have high precision in densely sampled tuning spaces, but they are poor in extrapolation.Therefore, if only a smaller part of the tuning space is sampled, we recommend testing whetherthe least-squares method would bring better precision and faster tuning convergence.The resulting decision tree is stored in files with su ffi x DT.sav and a list of hardware perfor-mance counters predicted by the script has su ffi x DT.sav.pc . For example, model for the file is stored in . Execution of Simulated Tuning
Simulated tuning scripts simulated-profiling-searcher.py perform a search of theautotuning space on a pre-computed tuning space. It requires auxiliary files base.py and mlKTTPredictor.py distributed with KTT. It also requires python3, with installed librariesdocopt, numpy, pandas, pickle and sklearn.Instead of real execution and profiling of autotuned kernels (obtaining their runtime andhardware performance counters), it reads stored raw autotuning data (i.e. just simulates theirexecution and profiling) and performs a search on them. The advantage of this approach isthat it is performed much faster than real tuning (as no compilation, execution and profiling areperformed); therefore, the simulated tuning experiments can be performed many times to getstatistically relevant data. The simulated run also does not require installation of KTT and GPUwe are simulating autotuning for. The convergence of search method is measured and can becompared in means of the number of search steps (equal to the number of kernel executionsperformed by KTT in a real environment).Listing 3 shows three examples of running the script. We want to tune gemm-reduced bench-mark. The models for predicting performance counters were trained on GeForce GTX 750, andwe want to use them to guide the search on GeForce GTX 1070. These three commands di ff erin the model used. The first one does not predict anything, only reads the performance counters’values from the provided raw tuning data file. The second one uses a decision tree, and the thirdone uses least-squares nonlinear models. Listing 3.
Example of simulated-profiling-searcher.py. python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--cm 750-gemm-reduced_output.csv --ic 5.0 -p 1 -t 4:19 -c 2,3,19:62 \--compute_bound -e 1000 -i 1000python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--dt 750-gemm-reduced_output_DT.sav --ic 5.0 -p 1 \-t 4:19 -c 2,3,19:62 --compute_bound -e 1000 -i 1000python3 -W ignore ./simulated-profiling-searcher.py \-o 1070-gemm-reduced_output.csv --oc 6.1 --mp 15 --co 1920 \--ls 750-gemm-reduced --ic 5.0 -p 1 -t 4:19 -c 2,3,19:62 \--compute_bound -e 1000 -i 1000
The script takes multiple arguments: iˇr´ı Filipoviˇc et al / Data in Brief (2021) • -o [raw tuning data file] the raw tuning data file following the format describedin the section Data Description • --oc [compute capability of GPU used to produce raw tuning data] e.g. 6.1,if raw tuning data came from GeForce GTX 1070 • --mp [number of multiprocessors on that GPU] e.g. 15, if raw tuning data camefrom GeForce GTX 1070 • --co [number of CUDA cores on that GPU] e.g. 1920, if raw tuning data camefrom GeForce GTX 1070 • one of the following – --cm [raw tuning data file] with this option, no prediction of values of per-formance counters is computed, their actual values are read from the file with givenraw tuning data – --dt [decision tree model file] with this option, decision tree model is em-ployed to predict values of performance counters – --ls [prefix for least squares model files] with this option, least squaresnonlinear models are employed to predict values of performance counters • --ic [compute capability of the GPU of training data for model] e.g. 5.0,if model was trained with data from GeForce GTX 750 • -p [column with computation time] always 1 in provided raw tuning data • -t [columns with TP] • -c [columns with PC] • -e [number of experiments] sets how many times the experiment is repeated to getmore stable results in case of randomized searchers • -i [number of iterations] sets how many tuning iterations (i.e., search steps) areperformed per experiment • --compute-bound or --memory-bound e.g., --compute-bound as that is the characterof gemm-reduced problemFor details on the algorithm of profile-based search, please see [1].The results of the analysis are stored as CSV files of the following format. The first linecontains a header with names of the columns. Each column presents an iteration of the searcher(i.e., the exploration of the next tuning configuration, requiring its profiling). The first columncontains the iteration number. The second column contains an average runtime with the standarddeviation of the best kernel known in this iteration when the random searcher is utilized. The Note that values of several arguments, such as compute capability of GPUs, number of multiprocessor or CUDAcores, and column indexes for computation time, tuning parameters and profiling counters are available in script autobench.py for our raw tuning data. This script is described later in this section.0
Jiˇr´ı Filipoviˇc et al / Data in Brief (2021) third column contains an average runtime with the standard deviation of the best kernel knownin this iteration when the profile-based searcher is utilized.The script autobench.py makes it easy to run simulated tuning for all our raw tuning data.It takes two arguments, --benchmark and --method . The option --benchmark GPU runs sim-ulated tuning for benchmarks conv, coulomb, gemm-reduced, mtran and nbody for all GPUs.Option --benchmark GEMM runs simulated tuning for di ff erent input sizes of benchmark gemm-reduced on GeForce GTX 1070. The option --method has three possible arguments Exact , DecisionTree or LeastSquares to denote the used model for predicting the values of per-formance counters. Users may need to modify the script to accommodate the names of folderswith data. Moreover, the script can be used as a source of information on possible values ofseveral command-line arguments of simulated-profiling-searcher.py , such as computecapabilities of di ff erent GPUs, number of their multiprocessors and CUDA cores, and indexes ofcolumns for computation time, tuning parameters and performance counters in raw tuning datawe provide.We used the simulated tuning to analyze the convergence speed of the profile-based searcherproposed in [1] and of the random search. During the analysis, the autotuning is performed inthe defined number of iterations. In each iteration of the searching process, the runtime of thebest kernel found is logged. The autotuning is performed multiple times, so we obtain an averagespeed of the best kernel for each iteration over multiple autotuning executions.Other search methods or modifications based on our profile-based searcher might be easilyadded to scripts and compared consistently. Execution Real-time Tuning
Real-time tuning performs a search of the autotuning space without a pre-computed tuningspace. The autotuned kernels are actually compiled, executed and profiled during the search.This is far more demanding than the simulated tuning described above, but it makes it possibleto measure the actual time per tuning search iteration. The convergence of search method ismeasured and can be compared in means of tuning time.Real-time tuning can be executed by running a compiled benchmark. For benchmarks bun-dled with KTT, the preparation includes the following steps: • KTT needs to be compiled with profiling. i.e. premake5 --profiling=cupti gmake needs to be run before building it. In case of older GPU architectures,use --profiling=cupti-legacy instead. • In the code of the benchmark (in cpp file), a preprocessor macro
EXHAUSTIVE SEARCH has to be set to 0. • The random search is used by default. To test profile-based searcher [1], the macro
USE PROFILE SEARCHER has to be set to 1 in the code of the benchmark (in cpp file). • The time for autotuning is restricted to a certain value set by macro
TUNE SEC . This timecan be altered, for example, with
TUNE SEC 60 performs autotuning for 60 seconds. • The prediction models needed for profile-based searcher have to be in ’KTT / profile-searcher- / models’ folder.Listing 4 shows examples of running a single real-time tuning for each benchmark and savingthe log file. iˇr´ı Filipoviˇc et al / Data in Brief (2021) Listing 4.
Running a single execution of real-time tuning on every benchmark cd KTT/build/x86_64_Release/./conv_cuda > conv_experiment_1.log./coulomb_sum_3d_cuda > coulomb_experiment_1.log./gemm_cuda > gemm_experiment_1.log./mtran_cuda > mtran_experiment_1.log./nbody_cuda > nbody_experiment_1.log
The benchmark executable has two arguments; however, both of them have default values: • [platform index] default value 0, cannot be changed when using CUDA • [device index] default value 0, users may need to change this if multiple GPUs areavailableThe input size can be modified in the source code of the given benchmark, in the main function. Again, reasonable default values are set. In our evaluation in [1], we set the input sizeof benchmarks according to Table 4.Proper evaluation of searcher’s convergence in means of tuning time requires multiple runsof real-time tuning. This can be easily done by executing the benchmark multiple times andgenerating a log file for each run. For easier processing of the results in multiple log files, weprovide script histogram.py . See an example of its usage in Listings 5. Listing 5.
Processing multiple runs of real-time tuning python3 ./histogram.py -s gemm-experiments -t 300
It takes two arguments: • -s [folder with log files from real-time tuning runs] ; • -t [time] in seconds that denotes the maximum running time that is analyzed.Results of histogram.py script are stored as CSV files of the following format. The firstline is a header containing names of the columns. Each row contains measurement in each secondof the autotuning process. The missing row indicates that no new data was available in particularsecond of the tuning process (this may happen in the initial part of the tuning if the first profiledkernel runs a long time). The rows contain the following columns: • the time (in seconds) from the beginning of the autotuning; • average runtime of the best kernel known in the corresponding time; • standard deviation of the runtime; • minimal runtime; • maximal runtime.The data are collected from all available executions of the autotuning (all log files in folderpassed by argument -s ). Jiˇr´ı Filipoviˇc et al / Data in Brief (2021)
Acknowledgments
The work was supported from European Regional Development Fund-Project ”CERIT Sci-entific Cloud” (No. CZ.02.1.01 / / /
16 013 / Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal rela-tionships which have, or could be perceived to have, influenced the work reported in this article.
References [1] J. Filipoviˇc, J. Hozzov´a, A. Nezarat, J. Ol’ha, F. Petroviˇc, Using hardware performance counters to speed upautotuning convergence on GPUs, Future Generation Computer Systems In Press (2021).[2] F. Petroviˇc, D. Stˇrel´ak, J. Hozzov´a, J. Ol’ha, R. Trembeck´y, S. Benkner, J. Filipoviˇc, A benchmark set of highly-e ffi cient CUDA and OpenCL kernels and its dynamic autotuning with kernel tuning toolkit, Future GenerationComputer Systems 108 (2020) 161–177.[3] C. Nugteren, V. Codreanu, CLTune: A generic auto-tuner for OpenCL kernels, in: Proceedings of the IEEE 9thInternational Symposium on Embedded Multicore / Many-core Systems-on-Chip (MCSoC), 2015.[4] J. Filipoviˇc, F. Petroviˇc, S. Benkner, Autotuning of OpenCL kernels with global optimizations, in: Proceedings ofthe 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy E ffi cient HPC Systems (ANDARE ’17),2017.[5] C. Nugteren, CLBlast: A tuned OpenCL BLAS library, in: Proceedings of the International Workshop on OpenCL,IWOCL ’18, ACM, 2018, pp. 5:1–5:10. doi:10.1145/3204919.3204924