[PDF] ALTIS: Modernizing GPGPU Benchmarking

Abstract

This paper presents Altis, a benchmark suite for modern GPGPU computing. Previous benchmark suites such as Rodinia and SHOC have served the research community well, but were developed years ago when hardware was more limited, software supported fewer features, and production hardware-accelerated workloads were scarce. Since that time, GPU compute density and memory capacity has grown exponentially, programmability features such as unified memory, demand paging, and HyperQ have matured, and new workloads such as deep neural networks (DNNs), graph analytics, and crypto-currencies have emerged in production environments, stressing the hardware and software in ways that previous benchmarks did not anticipate. Drawing inspiration from Rodinia and SHOC, Altis is a benchmark suite designed for modern GPU architectures and modern GPU runtimes, representing a diverse set of application domains. By adopting and extending applications from Rodinia and SHOC, adding new applications, and focusing on CUDA platforms, Altis better represents modern GPGPU workloads to enable support GPGPU research in both architecture and system software.

Full PDF

MMirovia: A Benchmarking Suite for ModernHeterogeneous Computing

Bodun Hu

University of Texas at Austin

Austin, [email protected]

Christopher J. Rossbach

University of Texas at Austin

Austin, [email protected]

Abstract —This paper presents Mirovia, a benchmark suitedeveloped for modern day heterogeneous computing. Previousbenchmark suites such as Rodinia [1] and SHOC [2] are wellwritten and have many desirable features. However, these toolswere developed years ago when hardware was less powerful andsoftware had fewer features. For example, uniﬁed memory wasintroduced in CUDA 6 as a new programming model and wasn’tavailable when Rodinia was released. Meanwhile, the increasingdemand for graphics processing units (GPUs) due to the recentrise in popularity of deep neural networks (DNNs) has openeddoors for many new research problems. It is essential to considerDNNs as ﬁrst-class citizens in a comprehensive benchmark suite.However, the main focus is usually limited to inference and modelperformance evaluation, which is not desirable for hardwarearchitects studying for emerging platforms. Drawing inspirationfrom Rodinia and SHOC, Mirovia is a benchmark suite thatis designed to take advantage of modern GPU architectures,while also representing a diverse set of application domains. Byadopting applications from Rodinia and SHOC, and includingnewly written applications with special focus on DNNs, Miroviabetter characterizes modern heterogeneous systems.

I. I

NTRODUCTION

GPUs have become popular for accelerating computing inindustry in recent years. Traditionally, these computationaldevices were used primarily for 3D rendering. However, theiruses in General Purpose Graphics Processing Unit (GPGPUs)has expanded rapidly since then. Heterogeneous computing,using both CPUs and GPUs, is becoming more and morepowerful, and this trend is likely to continue. These systemsare also becoming increasingly easier to program due to thedevelopment of languages such as CUDA. Because of the highdemand for heterogeneous systems, these tools are seeing risein popularity in the programming community.Existing benchmark suites such as Rodinia and SHOCwere designed to better understand the characteristics ofheterogeneous systems. A set of applications were curatedso that each benchmark exhibits unique behaviors that stressa characteristic or component of GPUs. Doing so allowsprogrammers to select the most suitable hardware and softwarefor speciﬁc tasks. However, these benchmark suites haven’tkept up with the evolution of programming frameworks: theydon’t utilize newer features, like HyperQ, that were introducedin recent versions of CUDA. Further, while hardware hasevolved to provide more raw compute, the task of scaling theinput to benchmarks has either been left to the user, as is the case with Rodinia, or is entirely untenable, as in SHOC. Thismakes applications unlikely to stay relevant as problems sizesgrow larger.GPUs are the de-facto platform for training deep neuralnetworks. Frameworks like Tensorﬂow [3] and Pytorch [4]are widely adopted in the neural network community. How-ever, they are not designed to study hardware behaviors andusually introduce excessive memory footprints, making themless desirable for hardware architects. SHOC does include aneuralnet benchmark, but it fails to capture the complexity ofrecent DNN models.This paper makes several key contributions: • Present the areas in which existing benchmark suitesRodinia and SHOC are lacking, speciﬁcally in workloaddiversity, problem sizes, and programming features. • Demonstrate the improvements made in Mirovia andhow these improvements characterize the capabilities ofmodern heterogeneous more comprehensively. Miroviaadds new features supported up to CUDA 9.0. • Understand the characteristics of neural networks as anew application domain and compare it to conventionalGPU applications by adding commonly used DNN ker-nels in Mirovia.II. R

ELATED W ORK

A. Rodinia

Rodinia is a suite of applications designed for heterogeneoussystems released in 2009. It consists of applications andkernels to represent different types of behavior of the Berkeleydwarfs [5]. The dwarfs represents 13 major categories ofcomputation used to describe most types of problems. Inaddition, Rodinia covers a diverse set of applications coveringcommunication, synchronization, and power consumption.However, Rodinia doesn’t utilize new features like uniﬁedmemory, HyperQ, Cooperative Groups, and Dynamic Paral-lelism. The benchmark suite is still using CUDA 4.0. There-fore, it will not utilize programming constructs or performanceimprovements introduced in newer CUDA versions, such asthe CUDA event library. Moreover, Rodinia doesn’t supporthardware capable of performing half precision operation andtensor core computation.To analyze how Rodinia stresses characteristics of GPUs,performance was proﬁled for each benchmark using nvprof. a r X i v : . [ c s . PF ] J un hese metrics quantify utilization rate on a scale of 0 to10. The utilization rate for each component indicates howmuch time was spent on one component to the total executiontime. Value 0 means idle, while 10 means full utilization.Because many applications run multiple kernels, the maximumutilization rate from any component is selected for that kernelin the benchmark.Figure 1 shows the average utilization rate and their standardvariance of the different functional units and the memoryhierarchy for each application in Rodinia. Since Rodinia doesnot come with default run parameters but rather a sample runcommand, the parameters in the sample command were usedfor each application. These sample ﬁles can be found in thedirectory of each benchmark.One observation is that the utilization of many compo-nents is relatively small even though many applications inRodinia consist of multiple kernels. For example, average GPUcomponent utilizations looks very similar for benchmarkssuch as gaussian, huffman, nw, and myocyte. Many hardwarecomponents are also not fully stressed to achieve maximumutilization. This is because Rodinia consists of higher-levelapplications, instead of microbenchmarks targeting speciﬁccomponents such as SHOC. The majority of kernels in eachbenchmark are not designed to target a speciﬁc GPU compo-nent. B. SHOC

Developed in 2010, SHOC is a suite of benchmarks forheterogeneous computing platforms. Unlike Rodinia, SHOCis divided into two primary categories: stress tests and perfor-mance tests. The stress tests use computationally demandingkernels to identify devices with bad memory, insufﬁcient cool-ing, and other device component related issues. Performancestests, on the other hand, concentrate on synthetic kernels andcommon parallel algorithms.All SHOC applications runn within a uniﬁed frameworkwhich accepts user-speciﬁed testing parameters such as num-ber of iterations to run. Detailed metrics, including ﬂoatingpoint operations per second (FLOPS), can also be recorded.These features are very useful for evaluating performances.However, many programs in SHOC are just basic parallelalgorithms, which may only be a subset of routines used inmore common and more complex applications. Even thoughSHOC covers a variety of dwarfs, it doesn’t represent thecomplexity present in real-world applications. Similar to Ro-dinia, SHOC was written at a time when newer software andhardware features weren’t available.Figure 2 shows GPU resource utilization and standard vari-ance for each application in SHOC. All metrics are collectedusing the largest preset data size available.Unlike Rodinia, the utilization rate no longer exhibits a ﬁxedpattern but varies over a diverse range. This is because SHOCconsists of microbenchmarks that target speciﬁc hardwarecomponents. However, most components are not fully stressedto maximum capacity.

C. Other Benchmarks

More recent benchmarks [6] [7] [8] have addressed irregularparallelism, rather than stressing overall heterogeneous perfor-mance. Further, not all new CUDA features have been takeninto consideration. Sarita, Alsop, and Sinclair [9] focused onbenchmarking the effects of different levels of synchroniza-tion (from coarse to ﬁne). Their work was more focusedon a benchmark suite that evaluates synchronization throughdata sharing (e.g. uniﬁed memory and coherent caches).The MAFIA framework [10] was designed to target multi-application execution in GPUs.Due to the rapid growth of popularity in machine learn-ing, there has been signiﬁcant focus on characterizing DNNbehavior on GPUs [11]–[13]. Popular frameworks such asTensorﬂow include primitive tools for users to analyze thecomputational demands of their models. Tango [14] is a frame-work to study behaviors of speciﬁc neural network model.DNNmark [15] is a framework to study various kernels usedin DNNs. III. M

OTIVATION

The previous section introduces Rodinia and SHOC, analyz-ing the pros and cons of each. Here we motivate the creation ofMirovia and outline the key areas in which existing benchmarksuites can be improved.

A. GPU Application Domain

GPUs have come to be used in many new domains inrecent years that are not represented by workloads in Rodiniaor SHOC, e.g., data analytics. Relational algebra and SQLstatements are not represented in either Rodinia or SHOC.Similarly, while GPUs have become the standard for DNNmodel training, benchmark suites do not include neural net-work based kernels. Even though GPUs were designed tomaximize throughput, recent introductions of specialized hard-ware like TPUs [16] suggests that there may be potential forimprovement in GPUs. It is necessary to study the behavior ofindividual kernels to ﬁnd potential performance improvements.

B. Better Dataset Sizes

One of the most obvious aspects of existing benchmarksuites to be updated is dataset sizes. In SHOC, there are 4preset data sizes. This lack of ﬂexibility makes it hard forSHOC to stay relevant in the future, as advancing technologywill eventually cause even the largest data size to be toosmall to stress GPU resources. Rodinia has the exact oppositeproblem, where benchmarks have no preset data size and theuser must specify their own problem sizes. Users have to rundata generation scripts even if they do not know what inputsize may be appropriate for the system they are benchmarking.Benchmarking with improperly sized input data throws thevalidity of the benchmarks into question, particularly when itis quite well established that memory coalescing can be usedto great effect on GPUs [17].2 a ck p r op b f s b + t r ee c f d d w t gau ss i an hea r t w a ll ho t s po t ho t s po t ff m an k m ean s l a v a M D l eu k o cy t e l ud m u mm e r gpu m y o c i t e nn pa r t i c l e f il t e r pa t h f i nde r s r ad s t r ea m c l u s t e r n w Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Sinlge P. Load/Store Special Tex

Fig. 1. GPU resource utilization by Rodinia benchmarks. 0 indicates no utilization; 10 indicates full utilization. b f s de v i c e m e m fft ge mm m d m d5ha s h neu r a l ne t q t c l u s t e r i ng r edu c t i on sc an s o r t s p m v s t en c il t r i ad Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex

Fig. 2. GPU resource utilization by SHOC benchmarks. 0 indicates no utilization; 10 indicates full utilization.

C. Support for Recent CUDA Versions

In addition to general performance improvements, eachnew version of CUDA typically releases new programmingconstructs that can be used to write more efﬁcient code. Theofﬁcial Rodinia benchmark suite relies only on features thatwere available in CUDA 4.0, and thus doesn’t take advantageof the newer CUDA constructs. It is essential to include thesenew features to understand their impact on performance.IV. T HE M IROVIA B ENCHMARK S UITE

In Mirovia, like SHOC, benchmarks are divided into levels.Each level represents benchmarks characterizing low levelcharacteristics such as memory bandwidth to performance onreal world applications. While determining a set of bench-marks for Mirovia, consideration was given to both the Berke-ley dwarfs and application domains. Table I shows the bench-marks included in Mirovia and their respective primitives andapplication domains. The Mirovia suite contains the followingfeatures: • A new set of benchmarks representing neural networklayers commonly used in popular DNN models. Thissection consists of 15 types of layers and 1 real-worldDNN models. They have been parallelized with CUDAAPIs and powered by libraries including cuBLAS andcuDNN (NVIDIA CUDA Deep Neural Network library).We decide to include a neural network model because in-dividual layers are not sufﬁcient in terms of representing real world application workﬂows. We use Darknet [18],an open source Neural network framework, to constructneural networks. However, most of the kernels in Darknetdon’t utilize the cuDNN library, which causes degradationin performance compared to industrial standards likeTensorﬂow. Thus, we reimplemented the most commonlyused kernels with cuDNN library and removed extramemory operations to reduce memory footprints. Webelieve these neural network applications will enrich thebenchmark diversity in Mirovia. • Mirovia aims to strike a balance between predeterminedinput sizes available in SHOC and customizable inputsizes available in Rodinia. Benchmarks contains presetsizes optimized for systems with different compute ca-pabilities, as well as a mechanism through which userscan specify the size and other aspects of their input. Thisfeature merges the favorable qualities from both Rodiniaand SHOC. • Several benchmarks have been updated utilized the mostrecent release of CUDA. For each feature, one benchmarkwas chosen to test it. These features include – HyperQ : it allows for multiple independent CUDAkernels to execute in parallel on the same GPU if theresources are available. HyperQ uses 32 independentWork Distributor Queues to detect opportunities forparallelism, whereas old architectures uses a singleWork Distributor Queue. This feature is implemented3

ABLE IM

IROVIA BENCHMARKS , THEIR RESPECTIVE PRIMITIVES AND APPLICATION DOMAINS

Level Benchmark Dwarf Application Domain New CUDA Feature in Pathﬁnder . – Dynamic Parallelism: it enables currently executingCUDA kernels to call child CUDA kernels(nestedparallelism). This feature is useful when runningalgorithms with hierarchical data structures and re-cursive algorithms with parallelism at each level.This feature is added to

Mandelbrot . – Uniﬁed Memory: it is a programming construct thatgives the programmer the illusion that the host andthe device share an address space. It establishes asingle memory address space visbile by all proces-sors in the system. When applications access datacurrently absent on the running device, the hardwareautomatically pages in data needed by the processor.This function is implemented in

BFS . – Cooperative Groups (Grid Sync):

This feature pro-vides another granularity of synchronization for ker-nel threads running on a GPU. GridSync allows usersto sync all threads in the entire grid before beginningnext section of computation. This features is usefulfor programs with disjoint phases of computationrunning right after one another. This takes the syn-chronization granuarity to a next level as previousCUDA versions only support __syncthreads() which synchronizes all threads in a single block.These feature is implemented in

SRAD . – CUDA Event API : this features allows accurate tim-ing of CUDA functions and kernel calls. This is animprovement from Rodinia since Rodinia which still uses system time.

A. Workloads1) Level 0:

Level 0 benchmarks are designed to measurelow level characteristics of the hardware. These benchmarksdo the simple task of measuring a single capability of theGPU and therefore dont represent any dwarves or applicationdomains.

BusSpeedDownload measures the speed of the PCI bus byrepeatedly transferring data of various sizes from the host tothe device. The data sizes are varied from 1kb to 500kb.

BusSpeedReadback measures the speed of the PCI bus,except in the opposite direction. Here, data is transferred fromthe device to the host.

DeviceMemory measures the bandwidth of different com-ponents of the memory hierarchy on the device. This includesglobal, constant, and shared memory.

MaxFlops (Half Precision) measures the maximum achiev-able ﬂoating point operations per second on the device. InSHOC, this benchmark runs tests using single and double pre-cision. The half-precision test will only work on architecturesthat suppot half-precision ﬂoating point arithmetic.

2) Level 1:

Level 1 benchmarks include basic parallelalgorithms which are common tasks in parallel computingand often used in kernels of real applications. While theseapplications represent a subset of the Berkeley dwarfs, theyare complex enough to represent real applications domains.

GUPS (Random Memory Access) stands for Giga-updatesper second. It measures how frequently a computer can issue4pdates to randomly generated RAM locations. This bench-marks stresses the latency and bandwidth of the device. Thistest is important because the random memory performancedirectly maps to the application performance.

Breadth First Search (Uniﬁed Memory) runs and mea-sures the performance for breadth-ﬁrst search, a commongraph traversal algorithm. This application was included be-cause it is control-ﬂow intensive. This benchmark is alsochosen to test uniﬁed memory feature in CUDA.

General Matrix Multiply is an application that measuresthe performance for different types matrix multiplications.The types of matrix multiplications include single and doubleprecision tests with and without transposing the input matrices.

Pathﬁnder (HyperQ) is an application that runs a shortest-path algorithm which serves as a test of irregular parallelism.While most conventional parallel algorithms have uniformbehaviors across the different threads, irregular algorithmsare characterized by different threads performing differentexecutions. Depending on graph connectivity, different threadscan experience unique behaviors. In addition to this, pathﬁnderwill experience much higher control ﬂow unit utilizationcompared to regular parallelism algorithms as each threadneeds to decide how to execute independently. Therefore, wedecided to include HyperQ in this test.

Sort is an application that runs a fast radix sort [19] onan array of integers. It operates on key-value pairs of singleprecision ﬂoating point data.

3) Level 2:

Level 2 benchmarks are real world applicationkernels. Benchmarks in this level are applications that can befound in industry, and therefore represent a variety of GPUapplication domains. These applications represent a diversetypes of performance characteristics.

CFD Solver is a computational ﬂuid dynamics benchmark.This application solves the three-dimensional Euler equationsfor compressible ﬂow. This workload optimizes effective GPUmemory bandwidth by reducing total global memory accessesand overlapping computation.

GPUDWT is for discrete wavelet transform, an image andvideo compression algorithm that is also a popularly used dig-ital signal processing technique. This benchmark implementsboth forward and reverse, as well as 9/7 and 5/3 transforms.The 9/7 transform uses ﬂoats while the 5/3 transform usesintegers, so its important to measure the performance for both.

KMeans is a popular clustering algorithm used in data min-ing. This algorithm shows a high degree of data parallelism.At the beginning, K centers are chosen. In each iteration, eachdata point is assigned to a center, and at the end of eachiteration, each center is recomputed as the mean of all thedata points in its cluster until the two converge.

LavaMD calculates N-body particle interaction. The codecalculates particle potential and relocation due to mutualforces between particles within a large 3D space. This spaceis divided into cubes, or large boxes, that are allocated toindividual cluster nodes. The large box at each node is furtherdivided into cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries ofthe particle space have fewer neighbors. Particles only interactwith those other particles that are within a cutoff radius sinceones at larger distances exert negligible forces. Thus the boxsize is chosen so that the cutoff radius does not span beyondany neighbor box for any particle in a home box, thus limitingthe reference space to a ﬁnite number of boxes.

Mandelbrot (Dynamic Parallelism) computes an image ofa Mandelbrot fractal, a self repeating geometric pattern thatloops back on itself at ever decreasing sizes. A commonly usedalgorithm is the Escape Time Algorithm, which calculates thevalue for different pixels on a per pixel basis. This benchmarkwas added speciﬁcally to test Dynamic Parallelism, a featureadded to CUDA in version 5.0. With Dynamic Parallelism, thebenchmark switches to using the Mariani-Silver Algorithm.Unlike Escape Time, this procedure starts out coarse grained,and only iterates at a ﬁner resolution if necessary for certainsubsections.

Needleman-Wunsch is a nonlinear global optimizationmethod for DNA sequence alignments. The potential pairsof sequences are organized in a 2D matrix. In the ﬁrst step,the algorithm ﬁlls the matrix from top left to bottom right,step-by-step. The optimum alignment is the pathway throughthe array with maximum score, where the score is the valueof the maximum weighted path ending at that cell. Thus,the value of each data element depends on the values of itsnorthwest-, north- and west-adjacent elements. In the secondstep, the maximum path is traced backward to deduce theoptimal alignment.

ParticleFilter is a statistical estimator of the location of atarget object given noisy measurements of that targets locationand an idea of the objects path in a Bayesian framework. ThePF has a plethora of applications ranging from video surveil-lance in the form of tracking vehicles, cells and faces to videocompression. This particular implementation is optimized fortracking cells, particularly leukocytes and myocardial cells.

SRAD (Cooperative Groups) is a computer vision appli-cation used for reducing noise, or speckles, in images withoutdestroying important image features. This is done using partialdifferential equations. Since each stage of this applicationoperates on the entire image, SRAD requires synchronizationafter each stage. This makes SRAD the ideal benchmark totest the performance of using cooperative groups in CUDA.

Where is a new relational algebra benchmark. GPUs arebecoming increasingly popular for data analytics becauserelational algebra operations are easy to parallelize. Thisbenchmark acts like a ﬁlter for a set of records, returning asubset of the input records that meet a set of conditions. Itﬁrst maps each entry to a 1 or 0, before running a preﬁx sumand using both of these auxiliary data structures to reduce theinput data to just the matching entries.

4) DNN Kernels:

All benchmarks in this section representartiﬁcial neural network layers commonly seen in popularDNN models. All layers in this section include both forwardand backward passes.5 ctivation layer is used to decide whether a neuron shouldbe activated by calculating the weighted sum and adding biaswith it. It introduces non-linearity into the output of a neuron.Some of the most commonly used activation functions includeReLU, sigmoid, tanh, and LeakyReLU. Here we only presentReLU since it is the simplest one to understand. The followingdescribes the ReLU activation function. Note that x i representsthe input to the neuron and y i is the output. y i = max ( , x i ) (1) Pooling is common used between successive convolutionlayers in a ConvNet architecture. Its main function is to reducethe spatial dimensions on a convolution neural network. Forexample, applying a maxpool kernel of size 2 × × Batch normalization is a technique proposed to solvecovariate shift [20] in DNNs. When parameters in the pre-ceding layer change, the input to the current layer will changeaccordingly, causing the current layer to adjust to the newdistribution. The main goal of batch normalization is to limitthe shifting to a certain range to speedup training process andproduce reliable models.

Connected layers are those whose neurons are connected toevery neuron in the next layer. The connected layer can be seenas a feature vector that holds aggregated information from theprevious layer. For example, a connected layer can be rightafter a convolution layer which provides a low-dimensionalin-variate feature space. The fully connected layer can thenlearn a function from that space to produce more useful orabstract knowledge.

Convolution layer is mostly used to extract importantfeatures from images by assigning learning weights to variousobjects in those images. For example, give an RGB image ofsize 228 with 3 channels, we can train a convolution kernel ofsize 3 × ×

226 with 1 channel. The output tensorrepresents one feature in the image, such as the presence ofcurves in difference parts of the input image.

Dropout is a regularization technique used to prevent neuralnetworks from overﬁtting [21]. The key idea is to randomlydrop units from the neural network during training. Whentraining large neural networks on small data sets, overﬁttingcan be a huge issue when the model is evaluated on test dataset. Dropout solves this problem by stochastically introducingnoise to prevent units from co-adapting too much, thus makingthe model more robust.

RNN stands for Recurrent Neural Network. It is widelyadopted in learning tasks dealing with sequential data, suchas speed recognition, text generation, and so on. RNNs haveproven to be successful in capturing the dynamics of sequencesby keeping internal states(memory) which tracks informationfrom previous time stamps. Among the most commonly used RNNs are GRU and LSTM. In our benchmark, we only showresults for LSTM for simplicity.

Softmax layer is typically seen as the ﬁnal output layer ina neural network to perform multi-class classiﬁcation. It takesan input, usually a score value( z i , i = ... K ), and recomputesit as probabilities. Therefore, the outputs of the layer willrepresent a true probability distribution, where the sum of eachindividual output will equal to 1. Its calculation process isshown below: σ ( z c ) = e z c ∑ Kk = e z k (2) LRN (Local Response Normalization) is intended to sim-ulate a form of lateral inhibition [22] inspired by the typefound in real neurons. It allows diminishing response valuesuniformly large to neighborhoods and creates a high contrastin activation map. This feature is especially useful in unboundactivation functions such as ReLU. The original formula iswritten as b ix , y = a ix , y / ( k + α j = min ( N − , i + n / ) ∑ j = max ( , i − n / ) a ix , y ) β (3)where b ix , y is the regularized output for kernel i at position x , y , a ix , y is the source output of kernel i applied at position x , y , N is the number of kernels, n is the size of the normalizationneighbourhood, and α , β , k are hyper parameters of LRN.V. E VALUATION

In this section, we evaluate the applications in Mirovia interms of runtime characteristics, diversity, and performance.Our tests were performed on a machine with the followingspeciﬁcations: • Ubuntu 18.04.2 LTS • Linux 4.15.0-48-generic • CPU: 2 × Intel(R) Xeon(R) CPU E5-2650 v4 –

12 Core, 24 Thread – – –

30M L3 Cache • • GPU: Nvidia Tesla P100-SXM2 – Driver: 418.40.04 – – – A. Benchmark Performance

We use the nvprof proﬁling tool to collect the metricsgathered from running individual kernels. Note a number ofbenchmarks involve multiple kernels and some are redundant.We select the maximum utilization of each kernel and calculatetheir mean and corresponding standard deviation. The memoryand computational units utilization rate is provided in Figure 3and Figure 4.The utilization rate of different GPU components shows adiverse set of behaviors for both forward and backward passes6

ABLE IIS

ELECTED CU

DNN

KERNELS OF

DNN

BENCHMRKS

Benchmark Forward Kernel Backward Kernel

Activation op generic tensor kernel op generic tensor kernelPooling pooing fw 4d kernel pooing bw kernel avgBatchnorm bn fw tr 1C11 kernel NCHW bn bw 1C11 kernel newConnected maxwell sgemm 128x64 tn sgemm 128x128x8 TN vecConvolution maxwell scudnn 128x128 relu small nn wgrad alg0 engineDropout dropout fp dropout bpRNN maxwell sgemm 128x64 tn maxwell sgemm 128x64 nnSoftmax softmax fw kernel resident softmax bw kernelLRN lrnForward evenC lrnBackward evenC

Activation Pool Batchnorm Convolution Dropout FC LRN RNN Softmax

Benchmark U t ili z t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex

Fig. 3. Forward Kernel Utilization

Activation Pool Batchnorm Convolution Dropout FC LRN RNN Softmax

Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex

Fig. 4. Backward Kernel Utilization b f s c f d d w t

2d ge mm gup s k m ean s l a v a m d n w pa t h f i nde r pa r t i c l e f il t e r s o r t s r ad w he r e Benchmarks U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex

Fig. 5. Mirovia Utilization in DNN section of the benchmark. We observe the most uti-lized components are dram and single precision ﬂoating pointfunction unit, with backward average pool being an exceptionwith high utilization of shared memory and load/store functionunit.Table II shows the most relevant kernel of each DNNbenchmark. Each kernel’s collected metrics are presented inThe single precision ﬂoating point function unit is closely related to the IPC for most kernels. For example, both forwardand backward kernel passes for convolution results in highIPC. Low utilization of single precision ﬂoating point functionunit results in low IPC for batch normalization kernel. Theeligible number of warps per cycle also shows high numberfor convolution and low value for batch normalization. Thiscan be explained by that convolution has relatively good datalocality and spends less time waiting for data dependency to7e met, whereas batch normalization requires more memoryoperations which reduces the number of warps eligible to issuethe next instruction. This is a sign that convolution operationis compute bound and batch normalization is memory bound. A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmark U t ili z a t i on ForwardBackward A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmarks E li g i b l e W a r p s P e r C yc l e ForwardBackward A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmarks A c h i e v ed O cc upan cy ForwardBackward

The utilization for the rest of all benchmarks in Mirovia showa diverse range of values. Each GPU component utilization isincreased compared to SHOC. This can be explained by theincrease in input data size, which demonstrates the importanceof having user-deﬁned input problem size to stress hardware performance. These benchmarks also differ from DNN kernels.DNN kernel tend to stress dram and single precision functionunits heavily, while the conventional benchmarks exhibit amore diverse utilization of each component.

B. CUDA Feature Analysis

In this section, we analyzed benchmarks which implementnew CUDA features to ﬁnd out how each feature affects theperformance of applications. To do this, we show the speedupof applications using the feature over various preset problemssizes available in Mirovia.

Uniﬁed Memory : For this feature, the kernel time plusthe transfer time of BFS without uniﬁed memory was com-pared to the kernel time of BFS with uniﬁed memory, sincethere is no explicit transfer time when using uniﬁed mem-ory. Three different versions of BFS using uniﬁed mem-ory were tested and each was compared to a version ofBFS that doesnt utilize any new features. The ﬁrst ver-sion uses uniﬁed memory without cudaMemAdvise() or cudaMemPrefetchAsync() . The second version uses uni-ﬁed memory with only cudaMemAdvise() , and the lastversion uses uniﬁed memory with both cudaMemAdvise() and cudaMemPrefetchAsync() . We found that BFS withuniﬁed memory was able to run faster than the baseline ver-sion only when prefetching was introduced. Additionally, thespeedup was fairly inconsistent and did not scale with the inputsize. This is because the execution path is highly dependenton the generated graph. Since data is randomly generated, thisintroduces randomness to the speedup over various problemsizes. The result is reasonable because constant demand pagingintroduces execution overheads. HyperQ : HyperQ was added to the level 1 Pathﬁnderbenchmark. Since this CUDA feature increases the utilizationwhen multiple independent kernel can execute concurrently,we just ran multiple instances of Pathﬁnder on differentstreams. The graph shows the the speedup as the number ofconcurrent pathﬁnder kernels increases. The transfer time isnot included because it would stay the same regardless.Our result shows that the speedup gained from HyperQincreases as the number of parallel kernels scales up. This8peedup levels out around 32 parallel instances. This makessense as at this point the benchmark is saturating all 32independent work queues. In addition to this we see speedupstarting at a little under 1x for a single instance, and up to 4xthereafter. This follows as increasing the number of instanceshere makes use of more of the work queues that are just sittingidle.

Cooperative Groups : Similar to HyperQ, the kernel timefor SRAD using a cooperative kernel was compared to thekernel time for the original SRAD implementation. In thiscase, the transfer time was not included because it wouldhave been the same for both. The biggest drawback of usingcooperative groups is the limit on the number of blocksable to launch. Because of this, SRAD using a cooperativekernel could not be run on image sizes greater than 256x256.Therefore, to get more data points, we varied the problem sizeby multiples of 16 instead of powers of 2.

Dynamic Parallelism : For this feature, the speedup wasmeasured using the kernel times for Mandelbrot with andwithout Dynamic Parallelism. Like most of the other features,transfer time was not included in the speedup because it wasthe same for both versions of the benchmark. This benchmarkshows one of the cleanest increases in speedup as problemsizes increase. This primarily comes down to the efﬁciencyof the two algorithms used and what Dynamic Parallelism allows. While the traditional Escape Time algorithm is forcedto calculate values for every pixel, Mariani-Silver is allowedto subdivide and thus ignore ever increasing swaths of theimage. This is shown by the increasing speedup as imagesize increases. While this benchmarks inclusion of dynamicparallelism is used for regular parallelism, this feature canalso be used for more varied implementations. The Mandelbrotkernel explicitly calls itself over and over on smaller patchesof the image. Other programs may choose to have a masterkernel that calls multiple diverse sub-kernels. Mandelbrot isstill enough to stress Dynamic Parallelism and is thus whatwe chose to include in Mirovia.VI. C

ONCLUSION AND F UTURE W ORK

In creating Mirovia we aimed to modernize aspects ofpopular existing suites such as Rodinia and SHOC. To do this,we improved the spread of benchmarks included, bringing innew programs from different domains, while also adaptingproblem sizes to the abilities of modern hardware. Mostimportantly, we added support for measuring the performanceof many new features that were introduced in recent years.Features like Cooperative Groups, HyperQ, and half precisionarithmetic are all new enough that no current suite tests themat all. To capture the characteristics of DNN’s behaviors onGPUs, we included a set of popular neural network kernels. Inthis way we present Mirovia as a more complete benchmarksuite for the modern era. Plans for future work include: • Add benchmark support for GPUDirect RDMA. This is afeature that allows for direct data exchange between de-vices on a PCI bus. Utilizing this greatly reduces the timerequired for data transfers by bypassing typical memorycopies across various data planes. This is especially usefulfor pipe-lined data processing workloads where there aremultiple disjoint transformations acting on the same data.GPUDirect RDMA allows for data to stay off host theentire time, moving directly from stage to stage of theworkload. • Explore our benchmark diversity analysis by using thePrincipal Component Analysis (PCA) and HierarchicalClustering Analysis described by this paper [23].9

Incorporate new CUDA features such as CUDA graphsinto our benchmark to facilitate program speedup. • Add new benchmarks to test new hardware featuressuch as tensor cores [24] in more recent architectures.Tensor Core is a specialized hardware units designed forperforming mixed precision matrix computations com-monly used in deep learning neural network training andinference applications.R

EFERENCES[1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in , pp. 44–54, Oct 2009.[2] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spaf-ford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous com-puting (shoc) benchmark suite,” in

Proceedings of the 3rd Workshop onGeneral-Purpose Computation on Graphics Processing Units , GPGPU-3, (New York, NY, USA), pp. 63–74, ACM, 2010.[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wat-tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scalemachine learning on heterogeneous systems,” 2015. Software availablefrom tensorﬂow.org.[4] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,”

NIPS-W , 2017.[5] K. Asanovi, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, andK. A. Yelick, “The landscape of parallel computing research: A viewfrom berkeley,” Tech. Rep. UCB/EECS-2006-183, EECS Department,University of California, Berkeley, Dec 2006.[6] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, “Lonestar:A suite of parallel irregular programs,” in ,(Los Alamitos, CA, USA), IEEE Computer Society, apr 2009.[7] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia:Understanding irregular gpgpu graph applications,” in , pp. 185–195, Sep. 2013.[8] M. A. O’Neil and M. Burtscher, “Microarchitectural performancecharacterization of irregular gpu kernels,” in , pp. 130–139, Oct2014.[9] M. D. Sinclair, J. Alsop, and S. V. Adve, “Heterosync: A benchmarksuite for ﬁne-grained synchronization on tightly coupled gpus,” in ,pp. 239–249, Oct 2017.[10] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee,S. W. Keckler, M. T. Kandemir, and C. R. Das, “Anatomy of gpumemory system for multi-application execution,” in

Proceedings of the2015 International Symposium on Memory Systems , MEMSYS ’15,(New York, NY, USA), pp. 223–234, ACM, 2015.[11] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee,B. Schroeder, and G. Pekhimenko, “Benchmarking and analyzing deepneural network training,” in , pp. 88–100, Sep. 2018.[12] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng,N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzingmachine learning workloads using a detailed gpu simulator,” in , pp. 151–152, March 2019.[13] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abelln,J. Kim, D. Kaeli, and A. Joshi, “Proﬁling dnn workloads on a volta-baseddgx-1 system,” in , pp. 122–133, Sep. 2018. [14] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, andH. Jeon, “Tango: A deep neural network benchmark suite for variousaccelerators,” jan 2019.[15] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmarksuite for gpus,” in

Proceedings of the General Purpose GPUs , GPGPU-10, (New York, NY, USA), pp. 63–72, ACM, 2017.[16] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao,C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaem-maghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg,J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon,J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps,J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov,M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson,B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of atensor processing unit,” in , pp. 1–12, June 2017.[17] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “Colt: Coa-lesced large-reach tlbs,” in , pp. 258–269, Dec 2012.[18] J. Redmon, “Darknet: Open source neural networks in c.” http://pjreddie.com/darknet/, 2013–2016.[19] N. Satish, M. Harris, and M. Garland, “Designing efﬁcient sortingalgorithms for manycore gpus,” in , pp. 1–10, May 2009.[20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

ICML , 2015.[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overﬁt-ting,”

Journal of Machine Learning Research , vol. 15, pp. 1929–1958,2014.[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in Neural Infor-mation Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc.,2012.[23] N. Goswami, R. Shankar, M. Joshi, and T. Li, “Exploring gpgpu work-loads: Characterization methodology, analysis and microarchitectureevaluation implications,” in

IEEE International Symposium on WorkloadCharacterization (IISWC’10) , pp. 1–10, Dec 2010.[24] S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter,“Nvidia tensor core programmability, performance & precision,” 032018., pp. 1–10, Dec 2010.[24] S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter,“Nvidia tensor core programmability, performance & precision,” 032018.