[PDF] Performance portability through machine learning guided kernel selection in SYCL libraries

Abstract

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited use. Additionally, parallel programming frameworks such as SYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical to deploy a large number of possible kernel configurations without inflating the library size. Machine learning methods can be used to mitigate against both of these problems and provide performance for general purpose routines with a limited number of kernel configurations. We show that unsupervised clustering methods can be used to select a subset of the possible kernels that should be deployed and that simple classification methods can be trained to select from these kernels at runtime to give good performance. As these techniques are fully automated, relying only on benchmark data, the tuning process for new hardware or problems does not require any developer effort or expertise.

Full PDF

PPerformance portability through machine learning guided kernel selection in SYCLlibraries

John Lawson

Codeplay Software Ltd.

Abstract

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a widerange of hardware, however these techniques are typically focused on ﬁnding optimal kernel parameters for particularinput sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parametersprovided by a user, and so these techniques are of limited use. Additionally parallel programming frameworks such asSYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical todeploy a large number of possible kernel conﬁgurations without inﬂating the library size.Machine learning methods can be used to mitigate against both of these problems and provide performance forgeneral purpose routines with a limited number of kernel conﬁgurations. We show that unsupervised clustering methodscan be used to select a subset of the possible kernels that should be deployed and that simple classiﬁcation methods canbe trained to select from these kernels at runtime to give good performance. As these techniques are fully automated,relying only on benchmark data, the tuning process for new hardware or problems does not require any developer eﬀortor expertise.

Keywords:

Auto-tuning, SYCL, GPGPU, Machine learning, Performance portability

1. Introduction

Auto-tuning has been widely studied as a techniqueto allow libraries to obtain portable performance acrossa range of devices by utilising parameterized kernels andselecting the right parameters to match the compute ca-pabilities of the diﬀerent devices.For frameworks like OpenCL that provide their ker-nels as source code this works especially well. The sourcecode can be conﬁgured using the preprocessor to handleany number of possible parameter conﬁgurations. Otherparallel programming frameworks like CUDA and SYCLprovide the kernels in a compiled binary format, and soeach set of parameters requires a new binary blob contain-ing the kernel compiled with those parameters. Supportingmany diﬀerent kernel instantiations in these libraries addscomplexity and a cost in terms of library size and buildtimes.Standard auto-tuning techniques sample the kernel pa-rameter space in order to determine the set of parametersthat give the best performance for a given problem. Thisprocess aims to provide the absolute best performance forthat particular set of input sizes and problem parametersand so is especially eﬀective when these inputs and prob-lem parameters are constant. On the other hand the auto-tuning must be done every time the inputs or parameterschange, which is typically a costly process.

Email address: [email protected] (John Lawson)

As a result of this it is diﬃcult to use auto-tuning toprovide general purpose libraries that can cater to all pos-sible inputs. We look at using unsupervised machine learn-ing techniques to explore the space of kernel parametersand select a subset of kernels that can be deployed in alibrary to provide close to optimal performance on a widerange of possible inputs. These clustering techniques allowthe library to achieve over 90% of the optimal performancewhile limiting the library to include as few as four kernels.We also consider how well machine learning classiﬁ-cation methods can select from these kernels at runtime.Decision trees are an eﬀective way to do this, preserving alarge proportion of the possible performance while beingeasy to integrate into the library.When combined, these automated approaches are aneﬀective way to extract performance from parameterizedkernels suited for a wide range of possible inputs, and thisperformance can be achieved with very little developer ef-fort. These approaches allow a simple matrix multiplica-tion kernel to provide performance similar to or even muchbetter than hand optimized BLAS implementations on arange of hardware. We demonstrate this by comparing theinference time of VGG16, a popular image classiﬁcationnetwork implemented using SYCL-DNN, an acceleratedneural network library, when using diﬀerent matrix mul-tiplication routines. The tuned simple kernel is competi-tive on desktop GPUs and performs better than optimizedBLAS libraries on integrated GPUs and mobile GPUs.1 a r X i v : . [ c s . PF ] A ug . Background and related work OpenCL [1] is a heterogeneous programming frame-work developed originally by Apple and now maintainedby the Khronos Group. It is an open standard designedto provide a cross platform way to program a wide rangeof hardware from GPUs to FPGAs. OpenCL allows de-velopers to write compute kernels in a subset of C, whichare embedded within applications and libraries as stringsof source code. This source code is then just-in-time (JIT)compiled to match the target device at runtime.By using JIT compilation, OpenCL allows developersto use the preprocessor to inject constants and types intogeneric kernels. Diﬀerent versions of the same kernels canbe compiled multiple times to match the diﬀerent inputsand sizes at runtime while using the same source code.As the same source code can be used for all the diﬀerentparameter values, there is no cost to using this techniquebeyond the additional compilation time to compile eachkernel.SYCL [2] is a more recent open standard from theKhronos Group, introduced in 2014 aiming to remove theboilerplate and complexity of lower level heterogeneousprogramming frameworks like OpenCL. Using SYCL a de-veloper can write compute kernels using standard C++ aswell as make use of the strong C++ type system to trackdata dependencies and manage data movement betweenhost and device.OpenCL requires hardware vendors to package a Ccompiler with their device drivers, but to support SYCL itwould be more challenging to include a full C++ compilerand JIT compile heavily templated C++ kernels. Instead,SYCL adopts a two stage compilation approach, where thekernels are initially compiled to an intermediate represen-tation (IR) that is bundled with the library or applicationbinary. This IR blob is then passed to the OpenCL JITcompiler at runtime, signiﬁcantly reducing the amount ofwork required to compile the kernels at runtime.The downside of shipping kernels in a binary formatis that these now include the kernel parameters, and so adiﬀerent binary blob is required for each instantiation ofthe kernel.There are many existing OpenCL and SYCL acceler-ated compute libraries, including the BLAS implemen-tations clBLAS [3], CLBlast [4] and SYCL-BLAS [5, 6].Each of these libraries is tuned for their target hardwareto some extent. These libraries either provide a set of hard-coded kernel parameters for given inputs chosen by hand totry to give good performance, or include more automatedapproaches that include benchmark scripts that generatethese sets of parameters that can then be compiled into thelibrary. These automated approaches currently use heuris-tics and limited numbers of kernel benchmarks to try toestablish which parameters to use.

There are many auto-tuning techniques that have beenwidely studied. General purpose tuning frameworks suchas clTune [7] and Kernel Tuner [8] provide easy to use tun-ing for compute kernels, tuning OpenCL, CUDA and otherkernels. The techniques used by these frameworks combinekernel benchmarks to measure the performance of a givenof parameters, and a parameter search algorithm to selec-tively sample from the parameter space while maximisingperformance.Despite the sophistication of these parameter searchalgorithms, such auto-tuning systems can be expensive interms of power and time usage, and must be run for eachrequired set of inputs. This can be partially mitigated us-ing machine learning to learn a model of the kernel perfor-mance and using this model to predict reasonable parame-ters to start the auto-tuning search. Techniques discussedin [9] and in [10] replace the parameter search algorithmswith machine learning based approaches. A random sam-ple of kernel conﬁgurations are benchmarked, and thesetimings used to train a model that predicts the timings ofall other kernel conﬁgurations, allowing the optimal con-ﬁguration to be directly chosen from the predicted times.Other uses of machine learning in automated kernel op-timization include predicting whether an operation wouldbe computed faster on CPU or GPU [11, 12], and whethera kernel would perform better when manually caching datain local memory [13].Auto-tuning has been used to provide portable per-formance on diﬀerent hardware for a variety of diﬀerentcomputational tasks, including convolutions [14], matrixmultiplication [15, 4], FFTs [16] and stencils [17, 18].A diﬀerent approach to auto-tuning is to explore thediﬀerent kernel parameters during the end program run-time. This dynamic approach of auto-tuning allows thebest available conﬁguration to be found if the same prob-lem is computed multiple times. This is used in the Ten-sorFlow [19] and MXNet [20] machine learning frameworkswith the cuDNN [21] launcher options. While this does notprovide as ﬁne grained control as the kernel based auto-tuning techniques, it does allow coarse grained decisionsabout the best algorithm or approach to take for givenproblems on ﬁxed hardware.

3. A matrix multiply case study

Matrix multiplications are an integral part of moderndeep learning and many other domains, so having accel-erated routines optimized for particular hardware gives asigniﬁcant impact on the performance of these computa-tions. The kernels that calculate a matrix multiplicationhave been the target of many previous auto-tuning tech-niques as the kernels can easily be written to make useof many parameters. These kernels are less complicatedcompared to other stencil or convolutional kernels, whilehaving enough scope for loop transformations, tiling and2aching memory accesses that they are good targets fortuning.In [22] we introduced a matrix multiply case studyusing the parameterized kernels provided by the SYCL-DNN [23] library. This paper continues the study of auto-tuning these kernels, expanding the number of benchmarksand the devices targeted by the tuning techniques.Each work item in this matrix multiplication kernelcomputes a small tile of the output. For integers

R, A, C ,it loads an R × A tile from the left hand input and an A × C tile from the right hand input, which are accumulated intoan R × C output tile. These tile sizes are compile timeconstants that also correspond to the vector sizes used toload the values from memory, so the possible values are 1,2, 4 and 8. These three parameters give 64 diﬀerent kernelconﬁgurations.In addition to the compile time kernel constants weconsidered the eﬀects of diﬀerent work group sizes on per-formance, using a combination of 1, 8, 16, 32, 64 and 128.As the total work group size for a kernel is limited by thedevice drivers, we only used the following pairings: (1, 64),(1, 128), (8, 8), (8, 16), (8, 32), (16, 8), (16, 16), (32, 8),(64, 1) and (128, 1); giving a total of 640 possible conﬁg-urations to select from.To measure the eﬀects of the diﬀerent kernel param-eters and work groups sizes we ran a number of bench-marks on two platforms. With only 640 possible conﬁgu-rations it is feasible to test the performance of every con-ﬁguration. This allows us to evaluate whether the kernelselection techniques manage to choose the best perform-ing kernel and avoids any confounding factors that mayarise when combining these techniques with standard ker-nel auto-tuning techniques. As auto-tuning will typicallytry to selectively search the kernel parameter space it willonly end up sampling the performance of some kernel con-ﬁgurations and so would immediately discount some ker-nels from being chosen.Fully connected and convolutional layers in machinelearning models can be computed using matrix multiplica-tions. The SYCL-DNN library is designed to provide ac-celerated routines for machine learning models so matrixsizes derived in this way are representative of the typicalworkloads for the library. The benchmarks use the ma-trix sizes from three popular neural networks: VGG [24],ResNet [25] and MobileNet [26]. Overall these gave 300diﬀerent sets of sizes for the input matrices of the compu-tations. The benchmarking framework used to collect the dataran a small number of warmup iterations to ensure the de-vices were running at optimal clock speeds and that thekernels were compiled. The measurement collected wasthe total time for a number of iterations of kernel execu-tion, giving an overall mean time for each kernel execution.The actual number of iterations varied depending on the

Parameter configuration0.00.51.01.52.02.53.0 T e r a f l o p s / s e c m=512, k=4608, n=784m=128, k=12544, n=128m=512, k=784, n=512 Figure 1: The performance of all diﬀerent kernel conﬁgurations forthree sets of input sizes on the AMD R9 Nano GPU, with varyingmatrix sizes from square to rectangular. Multiplying small reason-ably square matrices performs best overall and favors large tile sizes,while tall skinny matrices perform poorly in all conﬁgurations. time of execution, aiming for each benchmark to run foraround 1 second in total. Between each benchmark runthe framework paused for a short amount of time to helpreduce any thermal throttling, and device temperatureswere monitored during the benchmarking process to en-sure there was no throttling.The devices used to run the benchmarks were: • An AMD R9 Nano GPU (driver v2482.3). • An Intel i7-6700K CPU (driver v18.1.0.0920).We used SYCL on top of OpenCL to target these devices,providing the kernels as SPIR.

The matrix sizes in the dataset vary, with some beingvery large and others small, some fairly square with a largebatch size and others very tall and skinny. These diﬀerentsizes provide diﬀerent performance characteristics for thekernels on the hardware. For example the tall and skinnymatrices lead to very few threads being used in the mul-tiplication and so for large compute devices like the AMDGPU a lot of the compute capacity goes unused. Even us-ing auto-tuning to select the best kernel will not solve thisproblem, and really a separate kernel should be used thatis designed to utilize all the hardware for these sorts ofmatrix inputs. This is beyond the scope of the paper, butshould such a kernel be available then the type of kernelcould be considered as another parameter that has to beselected by a tuning system.As an example of this, on the R9 Nano the best per-forming conﬁguration (tiles (8, 4, 4), work-group (16, 16)for m=512, k=784, n=512, batch=16) achieves 3160 gi-gaﬂops per second, while the worst conﬁguration (tiles(1, 8, 1), work-group (8, 8) for m=32, k=12321, n=27,batch=1) only achieves 13 Gﬂops/sec. The best conﬁgu-rations for the small cases are the ones that use the most3 arameter configuration010203040 C o un t (a) AMD R9 Nano Parameter configuration010203040 C o un t (b) Intel i7-6700K CPUFigure 2: The number of times a conﬁguration of kernel parametersachieves optimal performance in the dataset. For the AMD GPU,one conﬁguration is best in 39 cases, but 80 distinct conﬁgurationsare best in at least one case. For the Intel CPU the top three conﬁg-urations are best in 35, 28 and 25 cases respectively and 68 are bestin at least one case. threads and so achieve the highest utilisation of the GPU,while the best conﬁgurations for large problems are theones that reuse the most data without spilling registers.As the numbers of threads and numbers of registers aredevice speciﬁc, these are the things that an automatedkernel deployment system would have to implicitly learnfrom the dataset.Figure 1 shows the performance for three diﬀerent setsof input matrix sizes. The more square matrices (m=512,k=784, n=512) allowed the kernels to perform best, but itonly achieved optimal performance in a very small numberof kernel conﬁgurations. In this case, of the 640 possibleconﬁgurations only 55 achieved over 2 teraﬂops/sec andonly 7 of those achieved over 3 teraﬂops/sec. This high-lights the importance of tuning the kernel parameters andensuring that the best parameters are available in a library.The second results in Figure 1 from a more rectangularset of input matrices (m=512, k=4608, n=784) have threekernel conﬁgurations that achieve over 2 teraﬂops/sec. Allthree of these kernel conﬁgurations achieve over 3 ter-aﬂops/sec with the square input sizes, but the best per-forming conﬁguration for the square inputs achieves lessthan 1.4 teraﬂops/sec for the rectangular input sizes. Thethird set of results correspond to an input set with a verylarge number of elements to accumulate, and as discussedabove the kernel used is not optimized for these cases andso performs poorly overall.The challenge faced by an automated kernel selectionprogram is that many diﬀerent conﬁgurations obtain thebest performance for diﬀerent matrix sizes. Figure 2 showsthat while there are a small number of conﬁgurations thatperform best in a large number of cases, there is a longtail where many other conﬁgurations also perform best in PCA component0.00.20.4 V a r i a n c e a s p e r c e n t a g e (a) AMD R9 Nano PCA component0.00.20.4 V a r i a n c e a s p e r c e n t a g e (b) Intel i7-6700K CPUFigure 3: The percentage of the variance of the dataset accounted forby each PCA component. For the AMD GPU over 80% of the vari-ance is accounted for in the 4 main components, 90% is accountedfor in 7 components, and 95% in 14. For the Intel CPU 4 compo-nents account for 80% of the variance, 6 components for 90% and 11components for 95%. at least one of the benchmarks. This long tail illustratesthe problem with pruning the number of conﬁgurationsrequired to deploy within a library, and suggests that anysuch pruning will result in some loss of performance. Thegoal of this paper is to determine whether an automatedsolution can minimize this loss in performance.The dataset and the corresponding code is availableonline [27]. The machine learning routines were providedby scikit-learn [28]. As discussed in Section 2.1 a SYCL library cannot de-ploy an unlimited number of kernels, as they are embeddedwithin the library as binary blobs. As such the kernels thatshould be deployed must be carefully selected to provideas much performance as possible. The number of kernelsto deploy could be determined through trial and error byinvestigating the achievable performance of diﬀerent num-bers of kernels. A more tractable approach would be toexplore the variance within the dataset and use that to es-timate how many kernels may encapsulate that variance.Principal component analysis (PCA) [29, 30] ﬁnds anew coordinate system for the dataset that concentratesthe variance into as few dimensions as possible. In this waythese principal dimensions contain the most distinguishinginformation about the dataset. Figure 3 shows the amountof total variance in the dataset that is accounted for byeach of the components identiﬁed by PCA. This highlightsthat the data is fairly structured and that the majorityof the variance is encapsulated within a small number ofcomponents.As PCA shows that most of the dataset’s variance canbe encapsulated in less than 15 components we study how4uch performance can be encapsulated when providingat most 15 kernels. We compare the performance thatis achievable when the number of kernels that would bedeployed in a library varies between 4 and 15.

For each set of matrix sizes, the benchmarks measuredthe performance as gigaﬂops per second for each kernel.This gives 640 ﬂoating point values describing the perfor-mance, ranging from 0 to the maximum Gﬂops/sec of thedevice.When comparing the performance of kernels for ﬁxedmatrix sizes, it is helpful to consider the comparative per-formance of the kernels instead of the raw ﬂops/s achieved.By normalizing the data to only show the comparative per-formance, the data is easier for an automated system tounderstand. Such a normalization technique should mapthe performance to a value between 0 and 1, with the bestperforming kernels valued at or close to 1, while poor per-forming kernels have a value closer to 0.In the original work, the only normalization techniqueconsidered was to scale the performance results relative tothe performance of the kernel that performed best. Thenormalized value is obtained by dividing the achieved per-formance by the maximal performance for a ﬁxed input.This provides a uniform mapping that preserves the rela-tive performance between all kernels.As the kernel selection process should infer more fromthe better performing kernels than the worst performingkernels, and hopefully never tries to select kernels that givemediocre performance, we can normalize the data to onlypreserve the kernels that perform well. We study threediﬀerent approaches of doing this.The ﬁrst approach is to use a raw cutoﬀ point, so thatall results under a certain threshold are clamped to 0. Inthe results below we consider a cutoﬀ value at 90% of thepeak performance, so all results that obtain less than 90%of the optimal performance for each set of inputs is set to0. This introduces sparsity in the data but does not changeany non-zero values, so they range between 0.9 and 1.An extension of this is to rescale the normalized dataafter clamping the poorly performing kernels. This ensuresthat the values make full use of the 0 to 1 range but mayencourage the models to discard good performing kernelsthat it thinks actually perform poorly. In the discussionbelow we refer to this as the standard cutoﬀ normalizationtechnique (as opposed to the raw cutoﬀ).A ﬁnal approach studied is to use a modiﬁed sigmoidfunction to map the scaled values, with many of the lesswell performing kernels mapped to 0. The sigmoid func-tion f ( x ) = (1 + exp(50 ∗ (0 . − x ))) − was constructedto map 85% performance to 0.5 with all values less than80% mapped to less than 0.1.Figure 4 shows the eﬀects these normalization tech-niques have on the best performing set of inputs for theAMD GPU, with the raw performance shown in Figure 1. Kernel configuration0.00.51.0 N o r m a li z e dp e r f o r m a n c e scaledrawcutoffsigmoidcutoff Figure 4: Comparison of diﬀerent data normalization techniques forthe best performing set of input sizes for the AMD GPU.

As the normalization techniques all clamp low perform-ing kernel conﬁgurations to zero, only the conﬁgurationsachieving over 75% of the performance of the best conﬁg-uration are shown.

4. Kernel selection

The techniques in this paper to deploy kernels in SYCLlibraries is made up of two steps. First the kernel conﬁg-urations should be selected, and then a simple model isconstructed to choose which of these conﬁgurations to useat runtime for a given problem. As SYCL kernels areembedded into the library as binaries it is impractical toinclude a large number of kernels. In order to balance per-formance and binary size the number of kernels must bepruned to those that give the best performance on a rangeof diﬀerent problems.The initial selection of kernels is done using unsuper-vised clustering of the dataset. For a given set of ma-trix sizes the dataset provides performance information foreach of the 640 kernel conﬁgurations. This performance in-formation can be represented as a point in 640-dimensionalspace, though as the raw times vary between matrix sizesit is useful to normalize these coordinates.Matrix sizes that have similar performance character-istics will naturally end up with similar coordinates, andso clustering techniques can be used to group these to-gether. By considering these clusters of similarly perform-ing matrix sizes we can extract which kernels give the bestperformance.

There are many unsupervised machine learning clus-tering techniques available which try to extract meaningdirectly from the data. These each have diﬀerent behaviorsand consider diﬀerent aspects of the data, so may extractwidely varying sets of kernels. K -means clustering A relatively simple clustering method is k -means clus-tering, which is an iterative method to ﬁnd k centroids thatminimize the distance from each points in the dataset totheir closest centroid. This method is eﬀective when theclusters have shapes that are close to the unit ball in thecoordinate space, however if the cluster shapes are less5egular or intertwined the method will struggle to sepa-rate the clusters. k -means clustering To help get around this, the coordinate space of thedataset can be transformed to help separate the clusters.One approach to do this is using Principal ComponentAnalysis to reduce the dimensionality of the dataset andconcentrate the variance of the dataset by making use ofthe full range of values in each of the new dimensions, thenusing k -means clustering on this transformed data. Another similar approach is to use a spectral trans-formation before using k -means clustering. A similaritygraph of the coordinates in the dataset can be representedas an adjacency matrix. The eigenvectors of the Lapla-cian of this matrix provide new coordinates that can beclustered using k -means. Density based methods can also be used to cluster data,which use the density of the data to establish the bound-aries between clusters. HDBScan [31, 32] is an example ofsuch a clustering method that uses a hierachical tree struc-ture to construct the clusters and provide better estimatesof outlying data.Unlike the other clustering methods, HDBScan doesnot provide a parameter for the number of target clusters,rather providing however many clusters it ﬁnds based onits other hyperparameters. In order to limit the numbersof clusters we compute the numbers of clusters for a sweepof the hyperparameters and in the following use whichevervalues gave the correct number of clusters.

While not a clustering method, decision trees can beused to choose a subset of a dataset by artiﬁcially limit-ing the number of leaf nodes in the tree. A decision treecan be trained as a regression solver that maps the inputmatrix sizes to the vector of performance data. Unlikethe clustering methods, this takes into account the ma-trix sizes rather than just the performance data. Each leafnode then ends up being a performance vector which is anapproximate representative of the performance vector forall input sizes that end up at that node in the tree.

To compare the eﬀectiveness of clustering methods forselecting kernel conﬁgurations to deploy in a library we ex-plored their outcomes given the benchmark dataset. Weused a selection method of choosing the kernels that gavebest performance by count. This Top-N method is a for-mulation of the methods used when previously selectingthe kernels manually, and serves as a useful baseline to seehow diﬀerently more advanced methods perform. The clustering methods provide either representativesof the clusters, such as the centroids of the k -means clus-ters, or just the cluster labels for each of the data entries.When there are representatives of the clusters, these canbe used to select an optimal kernel by looking at whichkernel conﬁguration performs best for the representative.When the full cluster is provided the optimal kernel is com-puted by taking the geometric mean of all elements in thecluster and choosing the best performing conﬁguration ofthis mean set of values. The dataset was split into training and test subsets, al-lowing a comparison of how well the techniques generalizeto previously unseen matrix sizes. Each proposed tech-nique used the training dataset to select a ﬁxed number ofkernel conﬁgurations and the test dataset was used to eval-uate what percentage of the optimal performance could beachieved by only considering those selected kernels.The optimal performance of the test data is given bythe benchmark data and normalized to between 0 and 1.A geometric mean of each value for the best performingkernel out of the selection was computed with all entriesof the test dataset to give this ﬁnal performance ﬁgure.Figure 5 shows the percentage of the optimal perfor-mance obtained by the diﬀerent clustering techniques onthe AMD GPU for the four diﬀerent normalization tech-niques discussed in Section 3.4. The machine learningmethods all perform better than the Top-N method of se-lecting the kernels based on those that perform best bycount, except when the number of kernels selected getsvery large. Some of the selection methods perform almostas well when selecting as few as 6 kernels, and don’t im-prove much as the number of kernels increases. This sug-gests that there are a small number of kernels that performwell for a wide range of input sizes, but that are not theones that actually perform best for a large number of in-puts.For example when the number of kernels is limited to4, the 4 top kernels by count are: • Tiles (4, 8, 4), work-group (16, 16) • Tiles (4, 8, 4), work-group (8, 16) • Tiles (4, 8, 4), work-group (8, 32) • Tiles (8, 4, 4), work-group (8, 32)The tile sizes are all similar, with slightly diﬀerent work-group sizes. These conﬁgurations perform similarly, andmust perform well for some of the most common inputsizes. However they do not perform well on the large num-ber of less optimal input sizes, and so overall this selectiongives poor performance. In comparison the decision treeselection is: • Tile (2, 8, 1), work-group (8, 32)6 P e r c e n t o f o p t i m a l Normalization: scale707580859095100 P e r c e n t o f o p t i m a l Normalization: cutoff707580859095100 P e r c e n t o f o p t i m a l Normalization: rawcutoff4 5 6 7 8 9 10 11 12 13 14 15Number of configurations chosen707580859095100 P e r c e n t o f o p t i m a l Normalization: sigmoidTopNDecisionTree KMeansPCAKMeans SpectralHDBScan

Figure 5: The performance of each pruning technique in Section 4as a percentage of the optimal obtainable performance for the AMDR9 Nano GPU, comparing the normalization techniques discussed inSection 3.4. P e r c e n t o f o p t i m a l Normalization: scale9092949698100 P e r c e n t o f o p t i m a l Normalization: cutoff9092949698100 P e r c e n t o f o p t i m a l Normalization: rawcutoff4 5 6 7 8 9 10 11 12 13 14 15Number of configurations chosen9092949698100 P e r c e n t o f o p t i m a l Normalization: sigmoidTopNDecisionTree KMeansPCAKMeans SpectralHDBScan

Figure 6: The performance of each pruning technique in Section 4as a percentage of the optimal obtainable performance for the Inteli7-6700K CPU, comparing the normalization techniques discussed inSection 3.4. Tile (2, 8, 4), work-group (16, 16) • Tile (4, 4, 4), work-group (8, 32) • Tile (4, 8, 4), work-group (8, 32)It includes only one of the top performing conﬁgurations,but this allows the overall kernel selection to be bettersuited to the diﬀerent corner cases. These much more var-ied conﬁgurations therefore give better performance acrossa wider range of the input sizes.All clustering methods performed well for the stan-dard scaled normalization, though the Spectral clusteringmethod performed worst after TopN. For the more sparsenormalization techniques the performance of the clusteringmethods start to become more varied. Both the decisiontree and k-means methods appear to perform well acrossthe diﬀerent normalization techniques, while the perfor-mance of HDBScan can vary.This is promising for extending this data to the muchmore sparse data that would be generated by other auto-tuning techniques that run benchmarks of many fewer con-ﬁgurations. In these cases the data will naturally be muchmore sparse than the brute force dataset, and these nor-malization techniques mimic the data that might be ob-tained from these approaches.The clustering methods most aﬀected by normalizationmethod are HDBScan and spectral clustering. When thedata becomes more sparse these methods appear to selectless optimal kernels and therefore gain worse performanceoverall. In addition HDBScan was the hardest to train, asthe numbers of clusters cannot be speciﬁed as a parameter,so a parameter search is required to select the best optionsto limit the numbers of kernels.Figure 6 shows the same data but for the Intel i7-6700KCPU. In the benchmarks this device was more consistent inthe performance that it achieved for diﬀerent input sizes.As such all kernel selection techniques performed signif-icantly better than for the AMD GPU, where there wasmuch more variation in the obtained performance.In these benchmarks, the HDBScan density based clus-tering technique performed surprisingly poorly and the re-sults varied signiﬁcantly depending on the number of ker-nels. For the standard normalization technique all testedparameters gave only 4 or 5 kernels.The decision tree clustering method performed well forthe AMD data, often achieving among the best perfor-mance, however for the CPU this is not the case. It seemedto lose the least performance on the raw cutoﬀ normaliza-tion scheme, but for all other normalization schemes thedecision tree tends to be outperformed by the other clus-tering methods.

The baseline option of choosing the kernels by whichappear to be best most often is a weak approach. Themore intelligent clustering methods outperformed this inthe majority of cases, as they consider the distribution of the data more generally and use that to select the kernelsthat provide better performance across a wide range ofinputs.The aim for the kernel clustering is to automaticallyprune the number of kernels to provide in a library. Assuch the chosen solution should provide good performanceregardless of the device or normalization scheme. The de-cision tree, spectral clustering and HDBScan clusteringgive varied performance across the devices and types ofnormalization, whereas the K-means and PCA+ K -meansclustering methods provide stable and good results. Thereare deﬁnitely cases where these relatively simplistic clus-tering techniques do not perform as well as some others,but the diﬀerence is rarely large.

5. Deploying the kernels

Selecting which kernels to deploy in a library is onlyhalf the story as our goal is to be able to support any inputsrequired by our users. This requires a method to map theuser’s inputs to the best kernel conﬁguration provided bythe library. Such a process must be carried out beforelaunching each kernel to ensure that the optimal choice ismade at each point. This means that the selection processmust be both eﬀective and inexpensive to compute; there islittle point gaining a small performance boost in the kernelif it is outweighed by time spent in a large classiﬁcationsystem.

The previous sections investigated how to limit thenumber of kernel conﬁgurations that should be providedin a library. Selecting which of these kernels to run isa classiﬁcation problem that maps the input matrix sizesto the optimal kernel conﬁguration. For each entry in ourdataset we can see which of the chosen kernels provides thebest performance, and train a classiﬁer to do this selectionusing standard supervised learning techniques.There are many diﬀerent techniques for classiﬁcationusing machine learning. The classiﬁer will have to be runeach time a new matrix multiplication is launched by thelibrary and so the main challenge is to balance the eﬀec-tiveness of the classiﬁer with the time taken to make aclassiﬁcation. More complicated state of the art classiﬁerslike neural networks may be very eﬀective, but they arealso computationally expensive and so would be a poorchoice to integrate in this way. Decision trees on the otherhand are easy to implement in a performant way and easyto integrate in a library, as they can be implemented as aseries of nested if statements within the kernel launcher.If a decision tree can eﬀectively infer the best kernel touse for unseen matrix sizes then this would be an idealsolution to use.To establish whether this is the case, we compare theeﬀectiveness of three decision trees to other classiﬁcationtechniques. The decision trees have increasing limits on8 able 1: The performance results for the classiﬁers as a percentageof the absolute optimal performance, for the kernel conﬁgurationsselected by PCA+ K -means for the AMD R9 Nano. Note that themaximum achievable performance for the selection of conﬁgurationsis limited to 91.19%, 94.62%, 94.94% and 96.89% for the 5, 6, 8 and15 conﬁgurations respectively. Number of conﬁgurationsClassiﬁer 5 6 8 15

DecisionTreeA 88.16 86.82 85.53 85.64DecisionTreeB 86.10 90.62 83.21 83.01DecisionTreeC 84.56 85.39 82.30 83.661NearestNeighbor 77.37 78.93 77.79 75.483NearestNeighbor 78.15 78.64 76.85 76.827NearestNeighbor 75.38 74.85 75.08 77.39LinearSVM 68.68 74.46 67.31 77.62RadialSVM 70.93 70.93 70.93 70.93RandomForest 86.91 89.31 87.60 83.96MLP 63.61 56.35 64.39 62.99

Table 2: The performance results for the classiﬁers as a percentageof the absolute optimal performance, for the kernel conﬁgurationsselected by PCA+ K -means for the Intel i7-6700K CPU. Note thatthe maximum achievable performance for the selection of conﬁgura-tions is limited to 96.55%, 96.65%, 97.34% and 97.95% for the 5, 6,8 and 15 conﬁgurations respectively. Number of conﬁgurationsClassiﬁer 5 6 8 15

DecisionTreeA 91.65 92.59 93.50 92.29DecisionTreeB 93.14 91.86 93.87 90.15DecisionTreeC 92.26 91.11 91.51 91.281NearestNeighbor 91.36 91.36 91.40 89.733NearestNeighbor 91.18 90.26 91.61 86.427NearestNeighbor 88.00 90.15 89.22 87.96LinearSVM 84.18 76.20 88.32 85.64RadialSVM 80.49 83.80 78.55 83.80RandomForest 93.65 93.90 93.26 93.85MLP 74.30 79.23 79.23 76.88the depth and numbers of samples allowed for leaf nodes.Varying these parameters helps establish how much thedecision tree might be overﬁtting. Deeper trees can ﬁtbetter to the training data, but will potentially overﬁt tosuit the training data and perform poorly on previouslyunseen inputs.The three decision trees are signiﬁed A, B and C. Deci-sion tree A has no limit on the maximum depth and allowssplitting down to single sample leaf nodes if required. De-cision tree B has a maximum depth of 6 and requires leafnodes to have at least 3 samples, while decision tree C hasa maximum depth of 3 and requires at least 4 samples atthe leaves. There are many other possible combinations ofparameters, however additional tuning of these risk over-ﬁtting to the testing data set.Nearest neighbor is another relatively simple classiﬁca- tion technique that classiﬁes an input based on which ofthe training inputs are closest to it. As such it requiresthat the training dataset be stored alongside the classiﬁerto compute which data points are the input’s neighbors.As such it would be infeasible to deploy within the librarybut provides a useful comparison for what similar classi-ﬁers can achieve.Other classiﬁers are more complex and require signif-icantly more computation to infer a class from an input.Classiﬁers like SVM, which computes the vectors that sep-arate the classes, and random forest ensembles, made upof multiple decision trees that are combined together, canpotentially provide better performance but would requiremore work on the host when choosing the kernel to launch.The comparisons made between these classiﬁers con-sidered how well they could infer the optimal kernel giventhe subset of kernels provided by the pruning techniquesdiscussed in Section 4. As the choice of kernels is limitedto this subset the maximum achievable performance is not100%.Tables 1 and 2 show the relative performance of thediﬀerent classiﬁcation methods for a range of possible ker-nel conﬁgurations. Overall the decision tree classiﬁcationmethods perform well, in many cases signiﬁcantly betterthan the more computationally expensive methods.One of the more surprising observations here is thatthe performance does not improve as the number of classesdoes, despite the theoretical maximum achievable perfor-mance increasing. The absolute best performance for bothdevices was obtained with just 6 kernel conﬁgurations, andthe decision tree obtaining best performance for either6 or 8 kernel conﬁgurations. While the additional ker-nel choices may allow higher theoretical performance, themodels seem to struggle to diﬀerentiate between similar in-puts that would require diﬀerent kernels. As such havingthe extra choice actually hinders the model’s performancerather than allowing it to achieve better performance.When comparing the three diﬀerent decision tree con-ﬁgurations, the performance data does not support thetheory that the tree may overﬁt to the training data. Themore limited trees (B and C) tend to perform worse thanthe unlimited decision tree (A), though the numbers arenot clear. When integrating the decision tree into theSYCL library it is helpful to provide some limits, so asto avoid heavily nested if statements and branching code.

6. Testing a full ML model

This work was carried out to help provide general pur-pose compute libraries to accelerate machine learning ap-plications. Comparing the inference time of a machinelearning model using these techniques to similar librariesthat use ore manual tuning techniques can show the eﬃ-cacy of this work.One of the popular image classiﬁcation models a fewyears ago was VGG16 [24], developed at the Oxford Visual9eometry Group in 2015. By modern standards it is asimple neural network made up of 16 convolutional andpooling layers. Despite the small number of layers it hasmore parameters than most modern networks with 138million, as the convolutional layers have many features.While no longer state of the art, this model is still reg-ularly used by machine learning practitioners and muchsimpler than more recent image classiﬁcation networks,making it a good candidate to use to evaluate the per-formance of the kernel selection process. Comparing theperformance of individual kernels provides a good proxy todetermine how well a system will perform, but an evalua-tion on the full system will help uncover any assumptionsand shortcomings that would not be visible at the micro-benchmark scale.A SYCL-DNN sample implements the VGG16 networkin SYCL using the pretrained weights provided by theKeras Applications [33] Python module. It can performimage classiﬁcation based on the ImageNet dataset, pro-viding the class of an input image from the 1000 diﬀer-ent ImageNet classes. This pretrained network achieves71.3% performance classifying the top class of an imagein the ImageNet dataset. It is not the best performingmodel available through Keras but is one of the simplestto implement.In addition to testing the performance of this networkon the devices discussed earlier in this paper, we alsotested two additional OpenCL devices. The kernels usedwere tuned for each device using the methods discussedabove and the resulting deployment and selection algo-rithms were integrated into SYCL-DNN.The devices used to test these techniques were: • AMD R9 Nano GPU • Intel i7-6700K CPU • Intel HD 530 Gen9 GPU • ARM Mali G71 GPU

SYCL-DNN allows users to specify diﬀerent backendsthat provide the matrix multiplication routines used inneural networks. The library provides its own matrix mul-tiplication, but if a platform has access to a BLAS or othermatrix library then it can be easily integrated to make useof these optimized routines. This functionality was usedto provide comparisons to the tuned SYCL-DNN matrixmultiplication kernels, using both a SYCL-BLAS [5] back-end and a CLBlast [4] backend.SYCL-BLAS is another library developed by Codeplayto provide basic linear algebra kernels. Designed with ex-pression trees and templated kernels it allows users to eas-ily fuse kernels together at compile time, reducing the needto load and store data between kernel launches, and is op-timized for a range of devices. SYCL-BLAS provides a number of diﬀerent matrix multiplication routines, includ-ing ones utilizing local (or shared) memory and ones de-signed for tall skinny matrices that compute partial resultswhich are combined in a ﬁnal reduction. These kernelsare signiﬁcantly more sophisticated than the simple ker-nel studied in this paper, however the parameters are alltuned by hand requiring signiﬁcant developer eﬀort andtime.CLBlast is an OpenCL based BLAS library designedto be performant on a wide range of OpenCL devices. Itincludes an automated tuning system to select the opti-mal kernels for diﬀerent devices, though this system islimited to selecting the single best kernel for each device.Before running this benchmark, the CLBlast library wastuned for each of the benchmark devices used. Similarly toSYCL-BLAS, the CLBlast library contains multiple imple-mentations of matrix multiplication kernels to help achieveperformance for diﬀerent matrix shapes.

The model was executed a number of times to accu-rately measure the time of completion. A single imagewas used as an input, and the model classiﬁes the con-tents of that image. The weights and initial image areall transferred to the compute device before starting tim-ing, so the benchmark time only includes the computationand not data transfer. The SYCL-DNN matrix multipli-cation routine was tuned to use 8 kernel conﬁgurations perdevice selected using PCA+ K -means and a decision treebased runtime selection process. As discussed in Sections 4and 5 these approaches give good performance for diﬀerentmatrix sizes and devices.Figure 7 shows the execution time to compute one in-ference using the VGG16 model. The diﬀerent devicesperform signiﬁcantly diﬀerently as would be expected asthey have vastly diﬀerent compute resources available.The AMD R9 Nano performed an inference in less than20ms using the optimized and tuned matrix multiplicationkernels from SYCL-BLAS and CLBlast. This GPU alongwith this particular machine learning model was one ofthe main targets of optimization during the developmentof SYCL-BLAS so it is expected that it performs well,outperforming both the kernel studied in this paper andCLBlast. The SYCL-DNN kernel achieved times that werenot far oﬀ the others, despite the kernel being much sim-pler than those in the heavily optimized libraries and notmaking use of the GPU’s fast local memory.By default CLBlast will use generic tuning parametersbased on similar devices, so for the R9 Nano the parame-ters are based on similar AMD cards. Tuning CLBlast forthis speciﬁc GPU using the provided tuning tools didn’tprovide any beneﬁt, though the actual kernels used didchange. For the other devices the tuning often had a neg-ative impact on the performance of CLBlast. This is likelyto be a result of the limited way that the tuning workscausing it to optimize for best results on matrix sizes thatdiﬀer from those used in the VGG16 model. The GEMM10 YCL-DNN SYCL-BLAS CLBlasttuned CLBlastuntunedLibrary05101520 T i m e ( m s ) AMD R9 Nano SYCL-DNN SYCL-BLAS CLBlasttuned CLBlastuntunedLibrary050100150200 Intel Gen9 SYCL-DNN SYCL-BLAS CLBlasttuned CLBlastuntunedLibrary02004006008001000 Intel i7-6700K SYCL-DNN SYCL-BLAS CLBlasttuned CLBlastuntunedLibrary0200400600 ARM Mali-G71

Figure 7: The inference time in miliseconds of a single image using the VGG16 model implemented using SYCL-DNN and diﬀerent matrixbackends when run on diﬀerent devices. routine in particular is tuned for single matrices of size1024x1042 and 256x256, whereas the inputs to GEMMused in the model have a batch size of 16 and vary from12544x64 to 512x512.For the Intel CPU and integrated GPU the SYCL-DNN kernel actually performed better than the optimizedlibraries. The CPU has very diﬀerent performance charac-teristics and compute resources to any of the GPUs, andCLBlast particularly struggles to adapt to this.Both SYCL-BLAS and CLBlast achieve similar per-formance on the ARM Mali GPU, taking over 700ms perinference. SYCL-DNN on the other hand achieves under400ms per inference, as it makes use of 4 diﬀerent conﬁgu-rations out of the chosen 8. This variety of possible kernelconﬁgurations allows the library to hand the diﬀerent ma-trix sizes where the other libraries only use a single kernelconﬁguration.One of the areas where the SYCL-DNN kernels are ata disadvantage to the other libraries is in the ﬁnal fullyconnected layers in the model. These fully connected lay-ers are implemented as a matrix multiplication, but whenusing a single image the activation tensor is actually a onedimensional vector rather than a matrix. As such it ismuch more eﬃcient to use a dedicated matrix-vector mul-tiplication routine common in BLAS libraries. The SYCL-DNN kernel is comparatively ineﬃcient in this case, as itis designed to compute 2D tiles of the output, which wouldonly be one-dimensional. Despite this, the library managesto provide suﬃcient performance on these operations thatthe automatically tuned SYCL-DNN kernels outperformthe other libraries overall.

7. Conclusions

Auto-tuning allows libraries to achieve performance ona wide range of devices without requiring vast amounts ofdeveloper eﬀort to adapt kernels and routines to new hard-ware. In this paper we used a matrix multiplication casestudy to evaluate some methods to allow auto-tuning tobe deployed in compiled SYCL libraries, balancing binarysize, performance and adaptability to unseen inputs. Unsupervised machine learning techniques like cluster-ing provide eﬀective methods to reduce the large kernel pa-rameter space for a wide range of diﬀerent input sizes with-out sacriﬁcing much performance. Some of these methodsproved more reliable and resilient than others, with someof the more advanced methods like density based cluster-ing methods struggling to provide performant kernels insome cases.One of the concerns raised in the original paper [22]introducing these ideas was that the techniques may relytoo heavily on the dense benchmark timing information.Intelligent auto-tuning techniques only sample from thevery large kernel parameter space, while the data collectedfor this study used a comparatively small parameter spaceand so used a brute-force benchmarking technique. Thenormalization techniques discussed in Section 3.4 intro-duce sparsity into the data and Section 4 shows that whilethis does have an impact on the performance of the ker-nel selection routines, this diﬀerence is minimal. This ispromising for extending these results to more complicatedkernels that use more parameters that can take a largerrange of values.After selecting the kernels to deploy in the SYCL li-brary, there needs to be a runtime routine to choose whichof these kernels to execute for any given input. The tech-niques discussed in Section 5 show that decision trees canprovide good performance, as well as being easy to imple-ment and integrate into a library.When integrated into SYCL-DNN, these techniquesmet or vastly exceeded other optimized BLAS librariesfor a representative machine learning model. The perfor-mance was competitive on a range of devices, from pow-erful desktop GPUs through to embedded mobile GPUs,even though the kernels themselves are relatively simpleand don’t use as many hardware features as those in theother libraries.Overall these tuning and deployment techniques pro-vide an eﬃcient subset of all possible kernels where the ker-nels have to be provided in binary format as with SYCL.These completely automated approaches allow new devicesto be supported with very little developer eﬀort and rela-tively small code changes.11 cknowledgements

The author would like to thank Duncan McBain andDaniel Soutar for thoughtful comments and interesting dis-cussions about this work. This research did not receive anyspeciﬁc grant from funding agencies in the public, commer-cial, or not-for-proﬁt sectors.

ReferencesReferences [1] J. E. Stone, D. Gohara, G. Shi, OpenCL: A parallel program-ming standard for heterogeneous computing systems, Comput-ing in Science Engineering 12 (3) (2010) 66–73. doi:10.1109/MCSE.2010.69 .[2] SYCL: C++ single-source heterogeneous programming forOpenCL, , accessed: 2019-03-11.[3] clBLAS: A software library containing BLAS functions writ-ten in opencl, https://github.com/clMathLibraries/clBLAS ,accessed: 2020-08-26.[4] C. Nugteren, CLBlast: A tuned OpenCL BLAS library, in: Pro-ceedings of the International Workshop on OpenCL, IWOCL’18, ACM, New York, NY, USA, 2018, pp. 5:1–5:10. doi:10.1145/3204919.3204924 .URL http://doi.org/10.1145/3204919.3204924 [5] SYCL-BLAS: An implementation of BLAS using the SYCLopen standard, https://github.com/CodeplaySoftware/SYCL-BLAS , accessed: 2019-04-09.[6] J. I. Aliaga, R. Reyes, M. Goli, SYCL-BLAS: Leveraging expres-sion trees for linear algebra, in: Proceedings of the 5th Interna-tional Workshop on OpenCL, IWOCL 2017, ACM, New York,NY, USA, 2017, pp. 32:1–32:5. doi:10.1145/3078155.3078189 .URL http://doi.org/10.1145/3078155.3078189 [7] C. Nugteren, V. Codreanu, Cltune: A generic auto-tuner forOpenCL kernels, in: 2015 IEEE 9th International Symposiumon Embedded Multicore/Many-core Systems-on-Chip, 2015, pp.195–202. doi:10.1109/MCSoC.2015.10 .[8] B. van Werkhoven, Kernel tuner: A search-optimizing GPUcode auto-tuner, Future Generation Computer Systems 90(2019) 347 – 358. doi:10.1016/j.future.2018.08.004 .URL https://doi.org/10.1016/j.future.2018.08.004 [9] T. L. Falch, A. C. Elster, Machine learning based auto-tuningfor enhanced OpenCL performance portability, in: 2015 IEEEInternational Parallel and Distributed Processing SymposiumWorkshop, 2015, pp. 1231–1240. doi:10.1109/IPDPSW.2015.85 .[10] J. Bergstra, N. Pinto, D. Cox, Machine learning for predic-tive auto-tuning with boosted regression trees, in: 2012 In-novative Parallel Computing (InPar), 2012, pp. 1–9. doi:10.1109/InPar.2012.6339587 .[11] D. Grewe, Z. Wang, M. P. O’Boyle, Portable mapping of dataparallel programs to opencl for heterogeneous systems, in: 2013IEEE/ACM International Symposium on Code Generation andOptimization (CGO), IEEE Computer Society, Los Alamitos,CA, USA, 2013, pp. 1–10. doi:10.1109/CGO.2013.6494993 .URL https://doi.org/10.1109/CGO.2013.6494993 [12] W. F. Ogilvie, P. Petoumenos, Z. Wang, H. Leather, Activelearning accelerated automatic heuristic construction for paral-lel program mapping, in: 2014 23rd International Conferenceon Parallel Architecture and Compilation Techniques (PACT),2014, pp. 481–482. doi:10.1145/2628071.2628128 .[13] T. D. Han, T. S. Abdelrahman, Automatic tuning of local mem-ory use on gpgpus, in: ADAPT Workshop proceedings, 2015,Vol. 1410.0759, 2014. arXiv:1410.0759 .[14] B. van Werkhoven, J. Maassen, H. E. Bal, F. J. Seinstra, Op-timizing convolution operations on GPUs using adaptive tiling,Future Generation Computer Systems 30 (2014) 14 – 26, spe-cial Issue on Extreme Scale Parallel Architectures and Systems, Cryptography in Cloud Computing and Recent Advances inParallel and Distributed Systems, ICPADS 2012 Selected Pa-pers. doi:10.1016/j.future.2013.09.003 .URL https://doi.org/10.1016/j.future.2013.09.003 [15] Y. Li, J. Dongarra, S. Tomov, A note on auto-tuning GEMMfor GPUs, in: G. Allen, J. Nabrzyski, E. Seidel, G. D. vanAlbada, J. Dongarra, P. M. A. Sloot (Eds.), ComputationalScience – ICCS 2009, Springer Berlin Heidelberg, Berlin, Hei-delberg, 2009, pp. 884–892.[16] A. Nukada, S. Matsuoka, Auto-tuning 3-D FFT library forCUDA GPUs, in: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis, SC’09, ACM, New York, NY, USA, 2009, pp. 30:1–30:10. doi:10.1145/1654059.1654090 .URL http://doi.org/10.1145/1654059.1654090 [17] A. Mametjanov, D. Lowell, C. Ma, B. Norris, Autotuningstencil-based computations on GPUs, in: 2012 IEEE Interna-tional Conference on Cluster Computing, 2012, pp. 266–274. doi:10.1109/CLUSTER.2012.46 .[18] Y. Zhang, F. Mueller, Auto-generation and auto-tuning of 3Dstencil codes on GPU clusters, in: Proceedings of the TenthInternational Symposium on Code Generation and Optimiza-tion, CGO ’12, ACM, New York, NY, USA, 2012, pp. 155–164. doi:10.1145/2259016.2259037 .URL http://doi.org/10.1145/2259016.2259037 [19] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard,Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schus-ter, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. War-den, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow:Large-scale machine learning on heterogeneous systems, soft-ware available from tensorﬂow.org (2015).URL [20] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, Z. Zhang, Mxnet: A ﬂexible and eﬃcientmachine learning library for heterogeneous distributed systems,ArXiv 1512.01274. arXiv:1512.01274 .[21] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,B. Catanzaro, E. Shelhamer, cudnn: Eﬃcient primitives fordeep learning, CoRR abs/1410.0759. arXiv:1410.0759 .[22] J. Lawson, Towards automated kernel selection in machinelearning systems: A SYCL case study, in: 2020 IEEE In-ternational Parallel and Distributed Processing SymposiumWorkshops (IPDPSW), 2020, pp. 475–478. doi:10.1109/IPDPSW50202.2020.00086 .[23] R. Burns, J. Lawson, D. McBain, D. Soutar, Accelerated neuralnetworks on OpenCL devices using SYCL-DNN, in: Proceed-ings of the International Workshop on OpenCL, IWOCL’19,ACM, New York, NY, USA, 2019, pp. 10:1–10:4. doi:10.1145/3318170.3318183 .URL http://doi.org/10.1145/3318170.3318183 [24] K. Simonyan, A. Zisserman, Very deep convolutional networksfor large-scale image recognition, CoRR abs/1409.1556. arXiv:1409.1556 .[25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning forimage recognition, in: 2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi:10.1109/CVPR.2016.90 .[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen, Mo-bileNetV2: Inverted residuals and linear bottlenecks, in: 2018IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, 2018, pp. 4510–4520. doi:10.1109/CVPR.2018.00474 .[27] Towards automated kernel selection in macine learning sys-tems: Supplementary code and dataset, https://github.com/jwlawson/tuning_kernels , accessed: 2020-02-07.[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, . Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machinelearning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.[29] K. Pearson, LIII. On lines and planes of closest ﬁt to systemsof points in space, The London, Edinburgh, and Dublin Philo-sophical Magazine and Journal of Science 2 (11) (1901) 559–572. doi:10.1080/14786440109462720 .URL https://doi.org/10.1080/14786440109462720 [30] M. E. Tipping, C. M. Bishop, Probabilistic principal compo-nent analysis, Journal of the Royal Statistical Society. Series B(Statistical Methodology) 61 (3) (1999) 611–622.URL [31] R. J. G. B. Campello, D. Moulavi, J. Sander, Density-basedclustering based on hierarchical density estimates, in: J. Pei,V. S. Tseng, L. Cao, H. Motoda, G. Xu (Eds.), Advances inKnowledge Discovery and Data Mining, Springer Berlin Heidel-berg, Berlin, Heidelberg, 2013, pp. 160–172.[32] L. McInnes, J. Healy, Accelerated hierarchical density basedclustering, in: 2017 IEEE International Conference on DataMining Workshops (ICDMW), 2017, pp. 33–42. doi:10.1109/ICDMW.2017.12 .[33] Keras applications: Reference implementations of popu-lar deep learning models., https://github.com/keras-team/keras-applications , accessed: 2020-08-27., accessed: 2020-08-27.