Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction
Ajitesh Srivastava, Naifeng Zhang, Rajgopal Kannan, Viktor K. Prasanna
TTowards High Performance, Portability, andProductivity: Lightweight Augmented NeuralNetworks for Performance Prediction
Ajitesh Srivastava ∗ § Naifeng Zhang ∗ § Rajgopal Kannan † and Viktor K. Prasanna ∗∗ University of Southern California { ajiteshs,naifengz,prasanna } @usc.edu † US Army Research [email protected]
Abstract —Writing high-performance code requires significantexpertise in the programming language, compiler optimizations,and hardware knowledge. This often leads to poor productivityand portability and is inconvenient for a non-programmerdomain-specialist such as a Physicist. More desirable is ahigh-level language where the domain-specialist simply specifiesthe workload in terms of high-level operations (e.g., matrix-multiply(A, B)), and the compiler identifies the best implemen-tation fully utilizing the heterogeneous platform. For creating acompiler that supports productivity, portability, and performancesimultaneously, it is crucial to predict the performance of variousavailable implementations (variants) of the dominant operations(kernels) contained in the workload on various hardware todecide (a) which variant should be chosen for each kernel inthe workload, and (b) on which hardware resource the variantshould run. To enable the performance prediction, we proposelightweight augmented neural networks for arbitrary combina-tions of kernel-variant-hardware. A key innovation is utilizing themathematical complexity of the kernels as a feature to achievehigher accuracy. These models are compact to reduce trainingtime and fast inference during compile-time and run-time. Usingmodels with less than 75 parameters, and only 250 training datainstances, we are able to obtain a low MAPE of 3%, significantlyoutperforming traditional feed-forward neural networks on 48kernel-variant-hardware combinations. We further demonstratethat our variant-selection approach can be used in Halideimplementations to obtain up to 1.7x speedup over Halide’s auto-scheduler.
Index Terms —Lightweight augmented neural networks, Per-formance prediction, Productivity, Portability, Compiler, Hetero-geneous Platforms
I. I
NTRODUCTION
With various heterogeneous technologies emerging today,there have been unprecedented opportunities for acceleratingapplications. Application-specific integrated circuits (ASICs)[1] provide highly specialized implementations but requireexpertise in implementation and are specialized for one appli-cation. On the other hand, CPUs, GPUs, and FPGAs providemore flexibility and are easier to program, but are much slowercompared to ASICs. Providing the flexibility in applicationsand ease of implementation while reaching the speedup offered § Equal contribution by ASICs has been the focus of many recent works [2],[3]. Even writing a CPU/GPU code to get most out ofavailable hardware requires programming expertise, hardwareknowledge, and time. Further, that optimized code may notbe “portable”, i.e., may not work well on a different platform.Finally, a domain-specialist such as a physicist is expectedto know the operations involved in their workload, but notthe details of their highly-optimized implementations. This isimportant for “productivity”, i.e., implementing the desiredworkflow with few lines of code, not worrying about the codeoptimizations.With the objective of achieving high performance, porta-bility, and productivity, we are building a compiler that exe-cutes a high-level domain-specific language on heterogeneousplatforms aligned with recent DARPA projects [4]. The userwill write a high-level code that can be broken down inhigh-level operations (matrix multiplication, convolution, etc.),we call kernels. The user only specifies the operation withthe inputs such as matrix-multiply(A, B) withoutworrying about the optimized implementation of the actualmultiplication, thus enabling high productivity . It is the com-piler’s job to automatically identify how to best execute thiscode by distributing the kernels among the available hardwareconfigurations on the platform.In order to identify a high performance execution plan, thecompiler should be able to predict the performance of a kernelon various hardware resources. This enables the followingdecisions: (i) Variant-selection:
A compiler may have severalvariants implementing the kernel on the same hardware inits library with potentially different performances, e.g., Boostlibrary vs Eigen library for matrix multiplication. The variationmay also come from setting certain parameters in the imple-mentation that affect the runtimes, such as compilation flagsand other tunable parameters of the implementation. Giventhe input, which variant should be selected? (ii) Mappingto hardware:
The workload is a collection of possibly in-terdependent kernels. Each kernel can be mapped to variousavailable hardware resources (CPUs, GPUs, etc.). For eachkernel-hardware pair, there may be a different kernel variant a r X i v : . [ c s . PF ] A ug hat is optimal. Having accurate kernel performance models iscrucial for these decisions. We acknowledge that our approachto designing this compiler is not suited for compiling arbitrarylow-level code as we rely on already available implementationsof certain kernels. However, the kernels chosen in the paperdominate the runtime of many workflows including machinelearning. In fact, our chosen kernels cover >
80% of theworkflows [5] in the DARPA SDH program [4], [6]. Weemphasize that predicting the execution time is more usefulthan simply knowing the better variant or hardware resourcefor individual kernels. For instance, suppose, we want toexecute two matrix multiplications that do not have any datadependencies on a platform containing a CPU and a GPU. Thefirst one involves matrices of size and the second of size . While the first multiplication alone may be faster onGPU, it should still be scheduled on the CPU so that the GPUis available for the second which is the larger multiplication.To enable portability , the compiler must support learningperformance models of execution times T ( K i , H j ) on arbi-trary platforms, where K i is an arbitrary kernel implementedon an arbitrary hardware H j . We do not assume any accessto hardware profilers or details of the kernel implementation.The kernel implementations on various hardware are treatedas black-boxes and we can only manipulate the inputs to theimplementations. This makes our approach easily extensiblewhen a new implementation of a kernel is added to the library.These performance models can be trained during compilerinstallation by generating benchmark datasets for each kernel(along with its variants) on the available hardware. To makethis feasible, the models must be lightweight so that theycan learn quickly with small training data without overfitting.Once the models are trained, the compiler will be ready forscheduling kernels at compile-time. The prediction may alsobe needed at runtime. The exact input to the kernel may not beknown at compile-time, and therefore, the mapping decisions(which variant to select and where to run) will have to bemade dynamically at runtime. Making the models compact isnecessary to ensure that they do not constitute a significantportion of the runtime. Here, we build performance modelsfor four ubiquitous kernels [4]found in common workflows(i) Matrix-Matrix Multiplication, (ii) Matrix-Vector Multipli-cation, (iii) Matrix Convolution, and (iv) Max-Pooling. Wepropose a novel approach called Augmented Neural Network( NN+C ) which is extremely lightweight and utilizes the timecomplexity function to perform execution time prediction.
Key Contributions:
Our key contributions are as follows. • We propose novel lightweight neural network models forkernel performance prediction on CPUs and GPUs. • We demonstrate that the lightweight models are portableto more than 48 kernel-variant-hardware combinations.Results from 48 combinations have been discussed thatinclude 4 kernels each of which has 2 variants on 3 CPUseach and 2 variants on 2 GPUs each. • We demonstrate that our models achieve low MAPE of with a small training set in a short amount of trainingtime outperforming traditional feed-forward networks for all 48 kernel-variant-hardware combinations. • We demonstrate that our performance models can be usedto identify the best implementation of a kernel wherethousands of variants can exist with significantly differentruntimes. Specifically, for Halide [7] implementation ofBlur filter our approach results in up to . × speed upover Halide auto-scheduler.II. R ELATED W ORK
Most existing works focus on predicting the performanceof the whole specific workload. Huang et. al. [8] use sparsepolynomial regression to predict the execution time of arbi-trary programs. In [9], a neural network is used to predictthe execution time of a workload. On the other hand, [10]proposes feature selection from workloads to identify similarapplications for which the runtimes are known and predictingthe runtime for the given application using mean or linearregression. These approaches are limited to one or similarapplications and will require retraining for every application,and thus is not a scalable approach. Further, it is not clearwhat type of workloads will result in good predictions and ifa similar approach can be ported to other hardware. Instead,we perform predictions at coarse-level building blocks ofa program on various hardware. If a compiler can predictperformance at coarse level operations (kernels such as matrixmultiply) on available hardware, it can make mapping deci-sions accordingly. For this, we consider four kernels that aredominant in many other workloads. Therefore, instead of beingtied to a particular workflow, our approach applies to many,such as the entire class of deep learning workloads.Other existing works [11], [12] rely on the instruction setarchitecture or hardware-specific metrics, which can poten-tially be used to predict kernel (instead of workload) perfor-mance. However, this would require explicit knowledge of thehardware and corresponding profilers, and thus will reduceportability. Our approach enables a black-box treatment of thekernels and allows prediction without knowing the specificarchitecture or implementation details. Table I summarizesthe works closest to ours. Although we do not compare ourapproach against the above-mentioned works quantitatively,as they are for different objectives, we do show comparisonagainst their chosen machine learning models (neural net-works and linear regression) and show that our lightweightaugmented neural networks achieve superior accuracy. Finally,our work is different from [13] as they focus on performanceprediction of hardware using hardware profiling instead of theperformance of dominant operations.Table I: Distinction from related works
Approach Workload Coverage PortabilityWorkload-specific [8]–[10] Low N/AISA/Hardware specific [11], [12] High Low/MediumOur work Medium High
II. P
ROPOSED A PPROACH
Problem Definition:
For each operation on an arbitraryplatform with arbitrary implementations, given correspondinginputs, find a lightweight model that accurately predicts theexecution time using a small amount of training time.To solve this problem, we propose the Augmented NeuralNetwork (NN+C). The key idea of NN+C is utilizing knownmathematical function f ( K, H ) as an extra input to NN.For example, in Matrix-Matrix Multiplication, besides usingbasic features such as matrix dimensions, matrix density asinputs, we calculate the number of total operations duringMatrix-Matrix Multiplication. Therefore, f ( K, H ) = m × n × k . f ( K, H ) for Matrix-Vector Multiplication, MatrixConvolution, and Max-Pooling is also calculated similarly.The lightweight aspect enables fast decision making duringcompile-time as well as run-time. These augmented neuralnetworks provide the flexibility to incorporate any tunableparameter available for the kernel and the hardware. A. Neural Network Structure
The structure of NN+C is shown in Figure 1. Inputs to theneural network are1) known mathematical function f ( K, H )
2) kernel parameters K i , such as input matrix dimensionand matrix density3) hardware/code-optimization parameters H j , for exam-ple, how many threads are used in the multi-threadedimplementation and other controllable features that mayaffect the runtime such as compilation flagsOur augmented neural network contains less than two hiddenlayers. The output layer has one node, which is the predictedexecution time. The number of nodes in hidden layers variesgiven different kernels and different inputs, resulting in differ-ent models for each kernel. Further, models for a given kerneldiffer for CPU and GPU due to different inputs: in CPU weuse multi-threading and take the number of threads as input.Thus, for example, for four kernels mentioned above and twohardware configurations, this results in eight different neuralnetwork structures. However, the structure of the modelsremains the same irrespective of the implementation of thekernel (e.g., different software libraries), and the type of CPUor the type of GPU (e.g., Intel or AMD). In this case, only theweights in the neural network that are learned during trainingwill change. B. Model Inputsa) Matrix-Matrix Multiplication( A m,n × B n,k ): Inputsare the dimensions of the matrices m , n , and k , densities ofmatrix A ( d = number of non-zero entries m × n ) and of matrix B ( d ),and the number of threads we utilize during multi-threadingon CPU, N thd , which is an extra input for operations on CPUand not present for GPU. We augment the neural network with c = f ( K, H ) , i.e., roughly the total number of operations inthe kernels. In this case, c = m × n × k . ...... ... f ( K, H ) K K m H H n O Inputlayer Hiddenlayer(s) OutputlayerFigure 1: Structure of Augmented Neural Network (NN+C) b) Matrix-Vector Multiplication ( A m,n × B n, ): Inputsare m , n , d , c , N thd as defined above. m and n are dimensionsof matrix A. d is the density of matrix A and the density ofvector B is set as 1. c is the number of operations, c = m × n . N thd is the number of threads. c) Matrix Convolution( A m,n ∗ B r,r ): Inputs are m , n , d , c , N thd as defined above, and r is the dimension of squarematrix B. d is the density of matrix A and the density of squarematrix B is set as 1. The number of operations is given by c = ( m − r + 1) × ( n − r + 1) × r . N thd is the number ofthreads. d) Max-Pooling ( A m,n ∗ B s,s ): Inputs are m , n , d , c , N thd as defined above, and s is the dimension of square matrixB. d is the density of matrix A and the density of squarematrix B is set as 1. The number of operations is given by c = (cid:6) ns (cid:7) × (cid:6) ms (cid:7) × s . N thd is the number of threads.IV. E XPERIMENTS
A. Platforms and Optimizations
To demonstrate portability of our models we conducted ourexperiments on five platforms: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz (
Xeon ), Intel(R) Core i7-8750H CPU@ 2.20GHz ( I7 ), Intel(R) Core i5-7360U CPU @ 2.30GHz( I5 ), NVIDIA Tesla K40c ( Tesla ) and NVIDIA Quadro K420(
Quadro ).To perform the kernel operations on CPU, we used the Eigenlibrary and Boost library in C++. Eigen/Dense, Eigen/Sparse,uBLAS/matrix, and uBLAS/matrix sparse are used to opti-mize dense and sparse matrix in each kernel. Multi-threadingwas also used in Eigen to vary the number of threads. How-ever, it is difficult to vary the number of threads without heav-ily changing the code structure in the Boost library. Owing toour black-box approach, we used a single thread in the Boostlibrary. Among our platforms, Xeon has 16 cores, 32 threads;I7 has 12 cores, 24 threads; and I5 has 2 cores, 4 threads.For all operations on GPU, we used two implementationsof CUDA to optimize, one through global memory and onethrough shared memory. This results in 10 implementations ofach kernel: 2 variants of 3 CPUs and 2 variants on 2 GPUs.We published our code for reproduciblility . B. Datasets
We measured the performance of four kernels on eachplatform: Matrix-Matrix Multiplication ( MM ), Matrix-VectorMultiplication ( MV ), Matrix Convolution ( MC ) and Max-Pooling ( MP ). Other kernels such as LU decomposition andBlur filter were also evaluated, but their results have beenomitted due to brevity. For each kernel-variant-hardware com-bination (there are 40 such combinations), we generated 500instances of data, where 250 instances were used to trainthe model and 250 instances to test. Each data instance wasgenerated randomly with ranges of parameters as describedin Table II. While the experiments may be conducted witha different set of ranges, we chose these ranges as they arecommon sizes for deep learning workflows. Since we usemulti-threading on CPU, all operations on CPU take an extrainput N thd , which is randomly generated between 1 to themaximum threads supported by the given platforms.Table II: Parameters for data generation Matrix-Matrix Multiplication m, n, k ∈ { , , , . . . , } d ∈ { , , , . . . , log2 m × n } d ∈ { , , , . . . , log2 n × k } Matrix-Vector Multiplication m, n ∈ { , , , . . . , } d ∈ { , , , . . . , log2 m × n } Matrix Convolution r ∈ { , , } m, n ∈ { r, r + 1 , r + 2 , . . . , } d ∈ { , , , . . . , log2 m × n } Max-Pooling r ∈ { , , , } s ∈ { , } m, n ∈ { r, r + 1 , r + 2 , . . . , } d ∈ { , , , . . . , log2 m × n } C. Models
Our augmented neural networks are built under the Ten-sorFlow framework. Each model is kept under 75 parametersto maintain lightweight and a short training time. All modelshave 2 dense layers and use ReLU as the activation function.We use Adam as the optimizer [14], with learning rate varyingbetween 0.01, 0.0001, and 0.0001. The loss function is chosento be mean squared error. Each epoch included training witha full batch. The number of parameters of each model as wellas its average training time is shown in Table III.Table III: Number of parameters and average training times
MM MV MC MPCPU 64, 19s 50, 18s 73, 6s 73, 6sGPU 41, 19s 73, 6s 50, 8s 73, 7s https://github.com/Naifeng/Augmented-Neural-Network D. Baselines
We compare our method against four baselines: (1) NeuralNetwork ( NN ). NN is the same as the implementation ofNN+C except that NN does not take the number of operationsas an extra input. (2) Constant ( C ). In C, we only take thenumber of operations as input and try to predict executiontime using linear regression. (3) Augmented Linear Regression( LR+C ). We take the same inputs as NN+C but use linearregression in LR+C. (4) Augmented Non-Linear Regression(
NLR+C ). In NLR+C we take the same inputs as NN+Cbut use the random forest regression [15]. Random forestbased regression has been demonstrated to be competitive inperformance prediction [16], [17].
E. Evaluation Metrics
We used mean absolute percentage error (MAPE) to evalu-ate the predictions { ˆ t , ˆ t , . . . , ˆ t N } obtained by the baselinesand our models w.r.t. the ground truth { t , t , . . . , t N } : M AP E = 100 N (cid:88) i | t i − ˆ t i | t i . (1)By the definition of MAPE, a small misprediction ( | t i − ˆ t i | )might lead to a exceptionally high MAPE (up to 5000%) ifthe true runtime t i is minute. Those extreme MAPE valuesskew the average despite most of the predictions being ac-curate. Thus, we introduced a threshold at the 30% of thetesting data, ranking from the lowest runtime to the highestruntime. Overall, the average runtime of testing data below thethreshold is 13% of the average runtime of all the testing data,but these low runtime data instances contribute approximately80% of the overall MAPE. For example, when analyzingNN+C performance on MC on GPU, MAPE of the lowest30% is 128% and the MAPE of the highest 70% is 15%,whereas the overall MAPE is 49%. Therefore, in reportingMAPE, we drop 30% of testing data with the lowest runtimefor a more precise assessment of models’ performance.V. D EMONSTRATION OF V ARIANT -S ELECTION
As a crucial application of our performance predictionapproach, we demonstrate that it can be used to pick the bestvariant for a given kernel, i.e., picking the best available codeamong several options. Possible scenarios include choosingbetween a CPU and a GPU implementation and identifyingcompilation flags that will be best suited for the kernel.To show the variant selection capability of our approach,we choose a scenario where the number of variants can beextremely high. Further, we choose two kernels different thanthe four discussed thus far to show the generalizability of ourapproach.We consider the Blur kernel (
Blur ) and Fast Fourier trans-form (
FFT ) kernel implemented in Halide [7]. A Halidecode decouples the functional program from its execution“schedule” that determines various aspects of the executionsuch as the ordering of the loops, degree of unrolling loops,and vectorization strategy. The schedule description can beonsidered as a combination of shape (feature space) andparameters. For instance, blur_y.tile(x, y, xi, yi, 128, 256) defines two dimensions of the shape and the parameters 128and 256 determine the tunable parameters along these dimen-sions. Changing the schedule does not affect the output of thecode, but it may significantly affect the runtime. Therefore,each schedule generates a variant of the same kernel, and ourtask is to identify the best variant to use.
A. Model Inputs
We train our compact augmented neural networks withinputs representing the schedule features. This allows us toquickly estimate runtimes of the code with various sched-ule parameters without actually executing the code. Halideprovides an auto-scheduler [18] that attempts to identify thebest schedule itself. We run the auto-scheduler to identifythe shape/feature space and ignore the suggested parameters.Within this feature space we generate candidate schedules S = { s , s , s , . . . , s N } , where each s i is a vector of param-eters, and find s = arg min i P ( s i ) , where P ( s i ) represents thepredicted runtime given by schedule s i . For kernels that Halideauto-scheduler are not applicable to, we identify the featurespace based on the provided manually written implementation.Input data dimensions n and augmented constant are alsofed into the neural network. We augmented n for Blur and n log n for FFT to corresponding variant-selection modelsgiven the complexity of Halide implementation of both kernels[19]. B. Platforms and Optimizations
We conducted variant-selection experiments on five plat-forms: Xeon, I7, I5, Tesla, and Quadro. We used Halide toimplement the kernel operations. More experiment settings ofvariant-selection can be found at our published code . C. Datasets
We evaluated two kernels: Blur and FFT. The performanceof Blur is measured on five platforms. Given that there is noexisting GPU schedule of FFT provided by Halide, we onlyconducted experiments of FFT on three CPU platforms. Todemonstrate that our models are able to identify the best im-plementation among numerous existing variants, we generatedthousands of data instances and restricted the training set toconsist of 250 instances to maintain portability.The following is a piece of code from the implementationof Halide Blur on CPU. Each of s , s , s and s resides ina .split() function and serves as a split factor. The innerloop runs from zero to the split factor and the outer loopruns from zero to the extent required by the first argumentdivided by the split factor [7]. Thus, a combination of { s , s , s , s } defines a candidate schedule and different scheduleshave significantly different runtimes. We varied each parameter https://github.com/Naifeng/Variant-Selection extensively to generate a candidate set. The schedule given byHalide auto-scheduler is {
8, 256, 128, 8 } . { Var x = b l u r x . a r g s ( ) [ 0 ] ;b l u r x. compute at ( blur y , x o ). s p l i t ( x , x vo , x vi , s1 ). v e c t o r i z e ( x vi ) ; }{ Var x = b l u r y . a r g s ( ) [ 0 ] ;Var y = b l u r y . a r g s ( ) [ 1 ] ;b l u r y. c o m p u t e r o o t ( ). s p l i t ( x , x o , x i , s2 ). s p l i t ( y , y o , y i , s3 ). r e o r d e r ( x i , y i , x o , y o ). s p l i t ( x i , x i vo , x i v i , s4 ). v e c t o r i z e ( x i v i ). p a r a l l e l ( y o ). p a r a l l e l ( x o ) ; } According to Halide implementation rules and current sup-ports (e.g., Halide only supports limited input dimensions forFFT), we varied parameters as described in Table IV. For Blurand FFT on CPU, we generated 1000 data instances for eachinput data dimension, resulting in 6000 instances and 4000instances, respectively. For Blur on GPU, we exhaustivelygenerated all possible combinations, that is, 1176 instances.Table IV: Parameters for data generation
Blur (CPU) n ∈ { , , , . . . , } s , s ∈ { , , , . . . , } s ∈ { , , , . . . , s } s ∈ { , , , . . . , s } Blur (GPU) n ∈ { , , , . . . , } s ∈ { , , , } s , s ∈ { , , , . . . , } FFT (CPU) n ∈ { , , , } s ∈ { , , , . . . , n − } s , s , s , s , s ∈ { , , , . . . , n } D. Models
Augmented neural networks used for variant-selection isthe same as models described in Section IV-C except that allmodels used for variant-selection have 3 dense layers. Thenumber of parameters of each model as well as its averagetraining time is shown in Table V.Table V: Number of parameters and average training times
Blur FFTCPU 71, 18s 67, 12sGPU 66, 7s N/A
E. Baselines
We compare our method against four baselines identical tobaselines described in Section IV-D: (1) NN, (2) C, (3) LR+C,and (4) NLR+C. In addition, for Blur on CPU, we compareour variant-selection approach with the Halide auto-schedulerto show the overall improvement. For Blur on GPU, due toigure 2: Performance predictions of four kernels using NN+Cthe fact that Halide does not have a stable auto-scheduler togenerate a GPU schedule, we compare our variant-selectionresults with the average runtime among the runtime of allcandidate schedules. Similarly, since current Halide auto-scheduler is not capable of scheduling a complicated pipelinesuch as FFT, we compare our results with the average runtimeamong the runtime of all candidate schedules for FFT on CPU.
F. Evaluation Metrics
We used MAPE and Spearman’s rank correlation coeffi-cient ( ρ ) to evaluate the predictions { ˆ t , ˆ t , . . . , ˆ t N } obtainedby the baselines and our models w.r.t. the ground truth { t , t , . . . , t N } . MAPE is defined in Equation (1). ρ is definedas ρ = 1 − (cid:80) Ni =1 d i N ( N − (2)where d i = | rank( t i ) − rank(ˆ t i ) | . rank( t i ) is the rank of t i among { t , t , . . . , t N } , ranking from the lowest valueto the highest value. rank(ˆ t i ) is the rank of ˆ t i among { ˆ t , ˆ t , . . . , ˆ t N } , ranking from the lowest value to the highestvalue. ρ ranges from − to . ρ of indicates a perfect positivecorrelation of two variables’ ranks and ρ of − indicates a perfect negative correlation of two variables’ ranks. The closer ρ is to zero, the weaker the correlation between the ranks.VI. R ESULTS
A. Kernel Performance Prediction
Figures 2 shows a visualization of using NN+C to predictkernel performance on two platforms. We choose the resultsof I5 and Tesla to represent the results on CPU and GPU,respectively. We pick matrix dimension as x-axis, plottingagainst execution time in seconds to visualize prediction. Aline joining two points in the plot indicates the correspondingprediction and ground truth. Note that very few points have asignificant misprediction. Tables VI, VII, VIII, and IX quantifythese results using MAPE.For all five kernels using any implementation, NN+C pro-duces the lowest MAPE in predictions. Ranking from thehighest accuracy (lowest MAPE) on average to the lowest is(1) NN+C, (2) NLR+C, (3) NN, (4) LR+C, and (5) C. Onaverage, NN+C outperforms traditional NN by a margin of8% and outperforms the second-best approach NLR+C by 5%.LR+C has a good prediction for kernels on GPU. Performanceof C on all platforms is worst among all kernels expect MV.able VI: Prediction MAPE of Matrix-Matrix Multiplication
CPU GPUEigen Boost CUDA
Global Memory
CUDA
Shared Memory
Xeon I7 I5 Xeon I7 I5 Tesla Quadro Tesla QuadroNN+C
14% 23% 8% 7% 27% 6% 7% 5% 8% 8%
NN 29% 31% 26% 20% 35% 19% 23% 13% 18% 16%C 39% 34% 28% 8% 34% 7% 9% 9% 10% 10%LR+C 44% 31% 33% 8% 34% 7% 8% 8%
8% 8%
NLR+C 23% 24%
9% 33% 7% 10% 10% 18% 19%
Table VII: Prediction MAPE of Matrix-Vector Multiplication
CPU GPUEigen Boost CUDA
Global Memory
CUDA
Shared Memory
Xeon I7 I5 Xeon I7 I5 Tesla Quadro Tesla QuadroNN+C
21% 21% 25% 11% 8% 9% 7% 7% 7% 6%
NN 22% 24% 29% 14% 11% 12%
8% 9%
23% 23% 11% 10%LR+C
22% 26% 12%
8% 9% 7% 7% 7% 6%
NLR+C 26% 25% 27% 12%
8% 9%
29% 28% 21% 22%
Table VIII: Prediction MAPE of Matrix Convolution
CPU GPUEigen Boost CUDA
Global Memory
CUDA
Shared Memory
Xeon I7 I5 Xeon I7 I5 Tesla Quadro Tesla QuadroNN+C
8% 21% 4% 30% 20% 13% 10% 15% 17% 19%
NN 9% 22% 7% 50% 30% 30% 16%
C 27% 40% 22% 48% 44% 40% 30% 30% 42% 42%LR+C 15% 32% 13% 46% 38% 37% 15%
29% 30%NLR+C 18% 32% 7%
32% 24% 17% 17% 21% 21%
Table IX: Prediction MAPE of Max-Pooling
CPU GPUEigen Boost CUDA
Global Memory
CUDA
Shared Memory
Xeon I7 I5 Xeon I7 I5 Tesla Quadro Tesla QuadroNN+C
23% 13% 14% 27% 12% 14% 14% 8% 25% 27%
NN 32% 20% 22% 36% 20% 34% 32% 32% 40% 47%C 67% 37% 43% 81% 41% 47% 93% 95% 40% 28%LR+C 50% 26% 27% 63% 31% 33% 75% 77% 40% 28%NLR+C 25%
13% 18%
14% 31% 29%
Overall, NN+C predicts more accurately for kernels on GPUthan those on CPU, achieving on average a low MAPE of 12%and 16%, respectively.We report the average error in MAPE for the four kernelsand the two hardware classes (CPU, GPU) in Table X. For eachkernel, MAPE was aggregated over all hardware and variants.For each hardware class, MAPE was aggregate over all thekernels, variants, and specific devices. We show the com-parison of NN+C against traditional NN. NN+C significantlyoutperforms NN in almost all cases. In fact, for MM, MAPEfor NN+C is less than half of that of NN suggesting that thetraditional neural network is far inferior than our augmentedneural network.Table X: Aggregated MAPE of NN+C vs. NN
MM MV MC MP CPU GPUNN+C 11% 12% 16% 18% 16% 12%NN 23% 14% 22% 32% 24% 20% a) Unconstrained Augmented Neural Networks:
To en-able fast inference, our models are kept extremely lightweight– less than 75 weights. Also, we only generate 500 datainstances for each kernel-variant-hardware combination, outof which 250 are used to train our models. In order to assesshow much of the performance is compromised due to theserestrictions, we build similar NN+C models with more param-eters and generate a larger dataset with 5000 data instances(2500 instances are used to train and 2500 to test). Figure 3illustrates the comparison between lightweight models and un-constrained models in terms of error. Overall, MAPE achievedby lightweight NN+C is 14% and by unconstrained NN+C is9%. Specifically, on CPU, using unconstrained NN+C, MM,MV, MC, and MP have a decrease on average MAPE of 5%,2%, 8%, and 12%, respectively. On GPU, using unconstrainedNN+C, MM, MV, MC, and MP have a decrease on averageMAPE of 1.5%, 1%, 3%, and 2.5%, respectively. However,accuracy comes at the cost of increased model size and the
500 1 ,
000 1 ,
500 2 ,
000 2 , Preparing Time (s) M A P E ( % ) XeonI7I5TeslaQuadroMMMVMCMPFigure 3: Performance comparison between Lightweight Mod-els and Unconstrained Modelsoverall time as summarized in Table XI.Table XI: Preparing time increase and model size increase
MM MV MC MPCPU 9.31x, 2.13x 2.30x, 2.12x 11.59x, 2.21x 10.24x, 2.48xGPU 5.08x, 8.80x 7.07x, 2.34x 4.28x, 2.12x 3.35x, 2.62x
The preparing time (training data generation time plusmodel training time) on average of lightweight NN+C on CPUis 104s and that of unconstrained NN+C is 1040s, which is10x of lightweight NN+C. The preparing time on average oflightweight NN+C on GPU is 20s and that of unconstrainedNN+C is 97s, 4.85x of lightweight NN+C. In addition to thepreparing time loss, model size is significantly reduced. Modelsize reduction is most evident in MM on GPU, comparedto unconstrained NN+C, lightweight NN+C model is 8.80xsmaller. The rest of the models are downsized by 2.29x onaverage.If the training and inference time is not constrained, thenone can use our unconstrained (larger and more accurate)augmented models. However, we envision that lightweightmodels may be necessary due to the following reasons: (a)at compile-time many kernels need to be evaluated: considerVGG16 inference that requires >
1M 2D-Convolutions. At agiven layer, there can be > B. Variant-Selection
Figure 4 shows the comparison of execution times forvarying input sizes of two kernels. Our predicted best scheduleproduces a runtime close to the true best schedule within thecandidate set in all cases. Further, Figure 4(a) shows thatusing our predictions, we were able to outperform Halide’sauto-scheduler, getting up to . × speedup in kernel Blur onCPU. As for Blur on GPU, we obtained up to . × speedupcompared to a randomly selected schedule on a small inputsize ( ), see Figure 4(b), and among all input sizes, wewere able to obtain a speedup of at least . × . Shown inFigure 4(c), we obtained up to . × speedup compared to arandomly selected schedule of Halide FFT on CPU.Note that MAPE varies among different train-test splitsable XII: Prediction MAPE and Spearman’s coefficient Halide Blur Halide FFTCPU GPU CPUXeon I7 I5 Tesla Quadro Xeon I7 I5NN+C , 0.91 , , , , NN 72%, 0.87 25%, 0.94 40%, 0.91 12%, 0.98 29%, 0.94 11%, 0.98 19%, 0.95 17%, 0.95C 1140%, 0.76 59%, 0.82 167%, 0.93 84%, 0.73 64%, 0.85 66%, 0.86 33%, 0.97 30%, 0.97LR+C 1687%, 0.66 93%, 0.82 411%, 0.84 106%, 0.89 62%, 0.87 44%, 0.97 32%, 0.92 26%, 0.90NLR+C 150%, , , , and training process. The MAPE value shown in Table XIIis the best (lowest) MAPE obtained by our methods andthe baselines on the Halide kernels. The table also showsthe Spearman’s rank correlation coefficient. Since the mainobjective is to select the best variant which requires the abilityto correctly rank the variants, this is the primary metric ofcomparison. We observe that our approach NN+C obtains thehighest rank correlation in the majority of the cases. NLR+Chas a higher rank correlation for Halide Blur while havingmuch worse MAPE. On the other hand, for Halide FFT,NLR+C obtains identical rank coefficients with NN+C andbetter MAPEs. However, NLR+C has a much higher inferencetime, and therefore likely to hinder the execution of the actualapplication if used by the runtime component of a compiler,as we argued in the end of Section VI-A. Figure 5 showsthe comparison of inference times between our lightweightNN+C and NLR+C. It is noteworthy that the inference timeof NLR+C is more than × of that of our approach.Figure 5: Inference time comparison of NN+C and NLR+CVII. C ONCLUSION
We have proposed a novel lightweight augmented neuralnetwork (NN+C), to predict kernel performance on CPUsand GPUs. Our approach is designed in support of creatingcompilers with high productivity, portability, and performance.To show that our models are portable to different platformswith different implementations, we have evaluated our modelon several CPUs and GPUs with multiple optimizations,resulting in a total of 48 kernel-variant-hardware combinations.Our models significantly outperformed the baselines includingstandard neural network. We have shown that our approach canbe used to identify the best variants even when the numberof variants is extremely high. We do so by demonstrating a . × speedup over Halide auto-scheduler. In future work, we will build prediction models for other popular kernels. Thesemodels will be used to perform optimized mapping of kernelsin workflows for various heterogeneous platforms.A CKNOWLEDGEMENT
This work is supported by the Defense Advanced ResearchProjects Agency (DARPA) under BAA number HR0011-20-9-0019 and by the National Science Foundation Award number1911229. Any opinions, findings, and conclusions or recom-mendations expressed in this material are those of the authorsand do not necessarily reflect the views of the sponsors.R
EFERENCES[1] M. M. Vai, W. S. Song, and B. M. Tyrrell, “Application-specific inte-grated circuits,” in
High Performance Embedded Computing Handbook (D. R. Martinez, R. A. Bond, and M. M. Vai, eds.), ch. 9, pp. 191–215,A Systems Perspective, 2008.[2] I. Kuon and J. Rose,
Quantifying and Exploring the Gap Between FPGAsand ASICs . Springer Publishing Company, Incorporated, 1st ed., 2009.[3] B. Zahiri, “Structured asics: opportunities and challenges,” in
Proceed-ings 21st International Conference on Computer Design
Acm SigplanNotices , vol. 48, no. 6, pp. 519–530, 2013.[8] L. Huang, J. Jia, B. Yu, B. gon Chun, P. Maniatis, and M. Naik, “Pre-dicting execution time of computer programs using sparse polynomialregression,” in
Advances in Neural Information Processing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, andA. Culotta, eds.), pp. 883–891, Curran Associates, Inc., 2010.[9] E. Ipek, B. R. De Supinski, M. Schulz, and S. A. McKee, “An approachto performance prediction for parallel applications,” in
European Con-ference on Parallel Processing , pp. 196–205, Springer, 2005.[10] W. Smith, I. Foster, and V. Taylor, “Predicting application run timesusing historical information,” in
Workshop on Job Scheduling Strategiesfor Parallel Processing , pp. 122–142, Springer, 1998.[11] C. Mendis, A. Renda, S. Amarasinghe, and M. Carbin, “Ithemal:Accurate, portable and fast basic block throughput estimation using deepneural networks,” arXiv preprint arXiv:1808.07412 , 2018.[12] E. Konstantinidis and Y. Cotronis, “A quantitative roofline model forgpu kernel performance estimation using micro-benchmarks and hard-ware metric profiling,”
Journal of Parallel and Distributed Computing ,vol. 107, pp. 37–56, 2017.[13] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou,“Gpgpu performance and power estimation using machine learning,”in , pp. 564–576, IEEE, 2015.14] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[15] L. Breiman, “Random forests,”
Machine learning , vol. 45, no. 1, pp. 5–32, 2001.[16] C. Dahinden and M. Ethz, “An improved random forests approach withapplication to the performance prediction challenge datasets,”
Hands-onPattern Recognition, Challenges in Machine Learning , vol. 1, pp. 223–230, 2011.[17] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown, “Algorithm runtimeprediction: Methods & evaluation,”
Artificial Intelligence , vol. 206,pp. 79–111, 2014.[18] R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, and K. Fa-tahalian, “Automatically scheduling halide image processing pipelines,”
ACM Transactions on Graphics (TOG) , vol. 35, no. 4, pp. 1–11, 2016.[19] K. Li and P. Hudak, “Memory coherence in shared virtual memorysystems,”