ALTIS: Modernizing GPGPU Benchmarking
MMirovia: A Benchmarking Suite for ModernHeterogeneous Computing
Bodun Hu
University of Texas at Austin
Austin, [email protected]
Christopher J. Rossbach
University of Texas at Austin
Austin, [email protected]
Abstract —This paper presents Mirovia, a benchmark suitedeveloped for modern day heterogeneous computing. Previousbenchmark suites such as Rodinia [1] and SHOC [2] are wellwritten and have many desirable features. However, these toolswere developed years ago when hardware was less powerful andsoftware had fewer features. For example, unified memory wasintroduced in CUDA 6 as a new programming model and wasn’tavailable when Rodinia was released. Meanwhile, the increasingdemand for graphics processing units (GPUs) due to the recentrise in popularity of deep neural networks (DNNs) has openeddoors for many new research problems. It is essential to considerDNNs as first-class citizens in a comprehensive benchmark suite.However, the main focus is usually limited to inference and modelperformance evaluation, which is not desirable for hardwarearchitects studying for emerging platforms. Drawing inspirationfrom Rodinia and SHOC, Mirovia is a benchmark suite thatis designed to take advantage of modern GPU architectures,while also representing a diverse set of application domains. Byadopting applications from Rodinia and SHOC, and includingnewly written applications with special focus on DNNs, Miroviabetter characterizes modern heterogeneous systems.
I. I
NTRODUCTION
GPUs have become popular for accelerating computing inindustry in recent years. Traditionally, these computationaldevices were used primarily for 3D rendering. However, theiruses in General Purpose Graphics Processing Unit (GPGPUs)has expanded rapidly since then. Heterogeneous computing,using both CPUs and GPUs, is becoming more and morepowerful, and this trend is likely to continue. These systemsare also becoming increasingly easier to program due to thedevelopment of languages such as CUDA. Because of the highdemand for heterogeneous systems, these tools are seeing risein popularity in the programming community.Existing benchmark suites such as Rodinia and SHOCwere designed to better understand the characteristics ofheterogeneous systems. A set of applications were curatedso that each benchmark exhibits unique behaviors that stressa characteristic or component of GPUs. Doing so allowsprogrammers to select the most suitable hardware and softwarefor specific tasks. However, these benchmark suites haven’tkept up with the evolution of programming frameworks: theydon’t utilize newer features, like HyperQ, that were introducedin recent versions of CUDA. Further, while hardware hasevolved to provide more raw compute, the task of scaling theinput to benchmarks has either been left to the user, as is the case with Rodinia, or is entirely untenable, as in SHOC. Thismakes applications unlikely to stay relevant as problems sizesgrow larger.GPUs are the de-facto platform for training deep neuralnetworks. Frameworks like Tensorflow [3] and Pytorch [4]are widely adopted in the neural network community. How-ever, they are not designed to study hardware behaviors andusually introduce excessive memory footprints, making themless desirable for hardware architects. SHOC does include aneuralnet benchmark, but it fails to capture the complexity ofrecent DNN models.This paper makes several key contributions: • Present the areas in which existing benchmark suitesRodinia and SHOC are lacking, specifically in workloaddiversity, problem sizes, and programming features. • Demonstrate the improvements made in Mirovia andhow these improvements characterize the capabilities ofmodern heterogeneous more comprehensively. Miroviaadds new features supported up to CUDA 9.0. • Understand the characteristics of neural networks as anew application domain and compare it to conventionalGPU applications by adding commonly used DNN ker-nels in Mirovia.II. R
ELATED W ORK
A. Rodinia
Rodinia is a suite of applications designed for heterogeneoussystems released in 2009. It consists of applications andkernels to represent different types of behavior of the Berkeleydwarfs [5]. The dwarfs represents 13 major categories ofcomputation used to describe most types of problems. Inaddition, Rodinia covers a diverse set of applications coveringcommunication, synchronization, and power consumption.However, Rodinia doesn’t utilize new features like unifiedmemory, HyperQ, Cooperative Groups, and Dynamic Paral-lelism. The benchmark suite is still using CUDA 4.0. There-fore, it will not utilize programming constructs or performanceimprovements introduced in newer CUDA versions, such asthe CUDA event library. Moreover, Rodinia doesn’t supporthardware capable of performing half precision operation andtensor core computation.To analyze how Rodinia stresses characteristics of GPUs,performance was profiled for each benchmark using nvprof. a r X i v : . [ c s . PF ] J un hese metrics quantify utilization rate on a scale of 0 to10. The utilization rate for each component indicates howmuch time was spent on one component to the total executiontime. Value 0 means idle, while 10 means full utilization.Because many applications run multiple kernels, the maximumutilization rate from any component is selected for that kernelin the benchmark.Figure 1 shows the average utilization rate and their standardvariance of the different functional units and the memoryhierarchy for each application in Rodinia. Since Rodinia doesnot come with default run parameters but rather a sample runcommand, the parameters in the sample command were usedfor each application. These sample files can be found in thedirectory of each benchmark.One observation is that the utilization of many compo-nents is relatively small even though many applications inRodinia consist of multiple kernels. For example, average GPUcomponent utilizations looks very similar for benchmarkssuch as gaussian, huffman, nw, and myocyte. Many hardwarecomponents are also not fully stressed to achieve maximumutilization. This is because Rodinia consists of higher-levelapplications, instead of microbenchmarks targeting specificcomponents such as SHOC. The majority of kernels in eachbenchmark are not designed to target a specific GPU compo-nent. B. SHOC
Developed in 2010, SHOC is a suite of benchmarks forheterogeneous computing platforms. Unlike Rodinia, SHOCis divided into two primary categories: stress tests and perfor-mance tests. The stress tests use computationally demandingkernels to identify devices with bad memory, insufficient cool-ing, and other device component related issues. Performancestests, on the other hand, concentrate on synthetic kernels andcommon parallel algorithms.All SHOC applications runn within a unified frameworkwhich accepts user-specified testing parameters such as num-ber of iterations to run. Detailed metrics, including floatingpoint operations per second (FLOPS), can also be recorded.These features are very useful for evaluating performances.However, many programs in SHOC are just basic parallelalgorithms, which may only be a subset of routines used inmore common and more complex applications. Even thoughSHOC covers a variety of dwarfs, it doesn’t represent thecomplexity present in real-world applications. Similar to Ro-dinia, SHOC was written at a time when newer software andhardware features weren’t available.Figure 2 shows GPU resource utilization and standard vari-ance for each application in SHOC. All metrics are collectedusing the largest preset data size available.Unlike Rodinia, the utilization rate no longer exhibits a fixedpattern but varies over a diverse range. This is because SHOCconsists of microbenchmarks that target specific hardwarecomponents. However, most components are not fully stressedto maximum capacity.
C. Other Benchmarks
More recent benchmarks [6] [7] [8] have addressed irregularparallelism, rather than stressing overall heterogeneous perfor-mance. Further, not all new CUDA features have been takeninto consideration. Sarita, Alsop, and Sinclair [9] focused onbenchmarking the effects of different levels of synchroniza-tion (from coarse to fine). Their work was more focusedon a benchmark suite that evaluates synchronization throughdata sharing (e.g. unified memory and coherent caches).The MAFIA framework [10] was designed to target multi-application execution in GPUs.Due to the rapid growth of popularity in machine learn-ing, there has been significant focus on characterizing DNNbehavior on GPUs [11]–[13]. Popular frameworks such asTensorflow include primitive tools for users to analyze thecomputational demands of their models. Tango [14] is a frame-work to study behaviors of specific neural network model.DNNmark [15] is a framework to study various kernels usedin DNNs. III. M
OTIVATION
The previous section introduces Rodinia and SHOC, analyz-ing the pros and cons of each. Here we motivate the creation ofMirovia and outline the key areas in which existing benchmarksuites can be improved.
A. GPU Application Domain
GPUs have come to be used in many new domains inrecent years that are not represented by workloads in Rodiniaor SHOC, e.g., data analytics. Relational algebra and SQLstatements are not represented in either Rodinia or SHOC.Similarly, while GPUs have become the standard for DNNmodel training, benchmark suites do not include neural net-work based kernels. Even though GPUs were designed tomaximize throughput, recent introductions of specialized hard-ware like TPUs [16] suggests that there may be potential forimprovement in GPUs. It is necessary to study the behavior ofindividual kernels to find potential performance improvements.
B. Better Dataset Sizes
One of the most obvious aspects of existing benchmarksuites to be updated is dataset sizes. In SHOC, there are 4preset data sizes. This lack of flexibility makes it hard forSHOC to stay relevant in the future, as advancing technologywill eventually cause even the largest data size to be toosmall to stress GPU resources. Rodinia has the exact oppositeproblem, where benchmarks have no preset data size and theuser must specify their own problem sizes. Users have to rundata generation scripts even if they do not know what inputsize may be appropriate for the system they are benchmarking.Benchmarking with improperly sized input data throws thevalidity of the benchmarks into question, particularly when itis quite well established that memory coalescing can be usedto great effect on GPUs [17].2 a ck p r op b f s b + t r ee c f d d w t gau ss i an hea r t w a ll ho t s po t ho t s po t ff m an k m ean s l a v a M D l eu k o cy t e l ud m u mm e r gpu m y o c i t e nn pa r t i c l e f il t e r pa t h f i nde r s r ad s t r ea m c l u s t e r n w Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Sinlge P. Load/Store Special Tex
Fig. 1. GPU resource utilization by Rodinia benchmarks. 0 indicates no utilization; 10 indicates full utilization. b f s de v i c e m e m fft ge mm m d m d5ha s h neu r a l ne t q t c l u s t e r i ng r edu c t i on sc an s o r t s p m v s t en c il t r i ad Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex
Fig. 2. GPU resource utilization by SHOC benchmarks. 0 indicates no utilization; 10 indicates full utilization.
C. Support for Recent CUDA Versions
In addition to general performance improvements, eachnew version of CUDA typically releases new programmingconstructs that can be used to write more efficient code. Theofficial Rodinia benchmark suite relies only on features thatwere available in CUDA 4.0, and thus doesn’t take advantageof the newer CUDA constructs. It is essential to include thesenew features to understand their impact on performance.IV. T HE M IROVIA B ENCHMARK S UITE
In Mirovia, like SHOC, benchmarks are divided into levels.Each level represents benchmarks characterizing low levelcharacteristics such as memory bandwidth to performance onreal world applications. While determining a set of bench-marks for Mirovia, consideration was given to both the Berke-ley dwarfs and application domains. Table I shows the bench-marks included in Mirovia and their respective primitives andapplication domains. The Mirovia suite contains the followingfeatures: • A new set of benchmarks representing neural networklayers commonly used in popular DNN models. Thissection consists of 15 types of layers and 1 real-worldDNN models. They have been parallelized with CUDAAPIs and powered by libraries including cuBLAS andcuDNN (NVIDIA CUDA Deep Neural Network library).We decide to include a neural network model because in-dividual layers are not sufficient in terms of representing real world application workflows. We use Darknet [18],an open source Neural network framework, to constructneural networks. However, most of the kernels in Darknetdon’t utilize the cuDNN library, which causes degradationin performance compared to industrial standards likeTensorflow. Thus, we reimplemented the most commonlyused kernels with cuDNN library and removed extramemory operations to reduce memory footprints. Webelieve these neural network applications will enrich thebenchmark diversity in Mirovia. • Mirovia aims to strike a balance between predeterminedinput sizes available in SHOC and customizable inputsizes available in Rodinia. Benchmarks contains presetsizes optimized for systems with different compute ca-pabilities, as well as a mechanism through which userscan specify the size and other aspects of their input. Thisfeature merges the favorable qualities from both Rodiniaand SHOC. • Several benchmarks have been updated utilized the mostrecent release of CUDA. For each feature, one benchmarkwas chosen to test it. These features include – HyperQ : it allows for multiple independent CUDAkernels to execute in parallel on the same GPU if theresources are available. HyperQ uses 32 independentWork Distributor Queues to detect opportunities forparallelism, whereas old architectures uses a singleWork Distributor Queue. This feature is implemented3
ABLE IM
IROVIA BENCHMARKS , THEIR RESPECTIVE PRIMITIVES AND APPLICATION DOMAINS
Level Benchmark Dwarf Application Domain New CUDA Feature in Pathfinder . – Dynamic Parallelism: it enables currently executingCUDA kernels to call child CUDA kernels(nestedparallelism). This feature is useful when runningalgorithms with hierarchical data structures and re-cursive algorithms with parallelism at each level.This feature is added to
Mandelbrot . – Unified Memory: it is a programming construct thatgives the programmer the illusion that the host andthe device share an address space. It establishes asingle memory address space visbile by all proces-sors in the system. When applications access datacurrently absent on the running device, the hardwareautomatically pages in data needed by the processor.This function is implemented in
BFS . – Cooperative Groups (Grid Sync):
This feature pro-vides another granularity of synchronization for ker-nel threads running on a GPU. GridSync allows usersto sync all threads in the entire grid before beginningnext section of computation. This features is usefulfor programs with disjoint phases of computationrunning right after one another. This takes the syn-chronization granuarity to a next level as previousCUDA versions only support __syncthreads() which synchronizes all threads in a single block.These feature is implemented in
SRAD . – CUDA Event API : this features allows accurate tim-ing of CUDA functions and kernel calls. This is animprovement from Rodinia since Rodinia which still uses system time.
A. Workloads1) Level 0:
Level 0 benchmarks are designed to measurelow level characteristics of the hardware. These benchmarksdo the simple task of measuring a single capability of theGPU and therefore dont represent any dwarves or applicationdomains.
BusSpeedDownload measures the speed of the PCI bus byrepeatedly transferring data of various sizes from the host tothe device. The data sizes are varied from 1kb to 500kb.
BusSpeedReadback measures the speed of the PCI bus,except in the opposite direction. Here, data is transferred fromthe device to the host.
DeviceMemory measures the bandwidth of different com-ponents of the memory hierarchy on the device. This includesglobal, constant, and shared memory.
MaxFlops (Half Precision) measures the maximum achiev-able floating point operations per second on the device. InSHOC, this benchmark runs tests using single and double pre-cision. The half-precision test will only work on architecturesthat suppot half-precision floating point arithmetic.
2) Level 1:
Level 1 benchmarks include basic parallelalgorithms which are common tasks in parallel computingand often used in kernels of real applications. While theseapplications represent a subset of the Berkeley dwarfs, theyare complex enough to represent real applications domains.
GUPS (Random Memory Access) stands for Giga-updatesper second. It measures how frequently a computer can issue4pdates to randomly generated RAM locations. This bench-marks stresses the latency and bandwidth of the device. Thistest is important because the random memory performancedirectly maps to the application performance.
Breadth First Search (Unified Memory) runs and mea-sures the performance for breadth-first search, a commongraph traversal algorithm. This application was included be-cause it is control-flow intensive. This benchmark is alsochosen to test unified memory feature in CUDA.
General Matrix Multiply is an application that measuresthe performance for different types matrix multiplications.The types of matrix multiplications include single and doubleprecision tests with and without transposing the input matrices.
Pathfinder (HyperQ) is an application that runs a shortest-path algorithm which serves as a test of irregular parallelism.While most conventional parallel algorithms have uniformbehaviors across the different threads, irregular algorithmsare characterized by different threads performing differentexecutions. Depending on graph connectivity, different threadscan experience unique behaviors. In addition to this, pathfinderwill experience much higher control flow unit utilizationcompared to regular parallelism algorithms as each threadneeds to decide how to execute independently. Therefore, wedecided to include HyperQ in this test.
Sort is an application that runs a fast radix sort [19] onan array of integers. It operates on key-value pairs of singleprecision floating point data.
3) Level 2:
Level 2 benchmarks are real world applicationkernels. Benchmarks in this level are applications that can befound in industry, and therefore represent a variety of GPUapplication domains. These applications represent a diversetypes of performance characteristics.
CFD Solver is a computational fluid dynamics benchmark.This application solves the three-dimensional Euler equationsfor compressible flow. This workload optimizes effective GPUmemory bandwidth by reducing total global memory accessesand overlapping computation.
GPUDWT is for discrete wavelet transform, an image andvideo compression algorithm that is also a popularly used dig-ital signal processing technique. This benchmark implementsboth forward and reverse, as well as 9/7 and 5/3 transforms.The 9/7 transform uses floats while the 5/3 transform usesintegers, so its important to measure the performance for both.
KMeans is a popular clustering algorithm used in data min-ing. This algorithm shows a high degree of data parallelism.At the beginning, K centers are chosen. In each iteration, eachdata point is assigned to a center, and at the end of eachiteration, each center is recomputed as the mean of all thedata points in its cluster until the two converge.
LavaMD calculates N-body particle interaction. The codecalculates particle potential and relocation due to mutualforces between particles within a large 3D space. This spaceis divided into cubes, or large boxes, that are allocated toindividual cluster nodes. The large box at each node is furtherdivided into cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries ofthe particle space have fewer neighbors. Particles only interactwith those other particles that are within a cutoff radius sinceones at larger distances exert negligible forces. Thus the boxsize is chosen so that the cutoff radius does not span beyondany neighbor box for any particle in a home box, thus limitingthe reference space to a finite number of boxes.
Mandelbrot (Dynamic Parallelism) computes an image ofa Mandelbrot fractal, a self repeating geometric pattern thatloops back on itself at ever decreasing sizes. A commonly usedalgorithm is the Escape Time Algorithm, which calculates thevalue for different pixels on a per pixel basis. This benchmarkwas added specifically to test Dynamic Parallelism, a featureadded to CUDA in version 5.0. With Dynamic Parallelism, thebenchmark switches to using the Mariani-Silver Algorithm.Unlike Escape Time, this procedure starts out coarse grained,and only iterates at a finer resolution if necessary for certainsubsections.
Needleman-Wunsch is a nonlinear global optimizationmethod for DNA sequence alignments. The potential pairsof sequences are organized in a 2D matrix. In the first step,the algorithm fills the matrix from top left to bottom right,step-by-step. The optimum alignment is the pathway throughthe array with maximum score, where the score is the valueof the maximum weighted path ending at that cell. Thus,the value of each data element depends on the values of itsnorthwest-, north- and west-adjacent elements. In the secondstep, the maximum path is traced backward to deduce theoptimal alignment.
ParticleFilter is a statistical estimator of the location of atarget object given noisy measurements of that targets locationand an idea of the objects path in a Bayesian framework. ThePF has a plethora of applications ranging from video surveil-lance in the form of tracking vehicles, cells and faces to videocompression. This particular implementation is optimized fortracking cells, particularly leukocytes and myocardial cells.
SRAD (Cooperative Groups) is a computer vision appli-cation used for reducing noise, or speckles, in images withoutdestroying important image features. This is done using partialdifferential equations. Since each stage of this applicationoperates on the entire image, SRAD requires synchronizationafter each stage. This makes SRAD the ideal benchmark totest the performance of using cooperative groups in CUDA.
Where is a new relational algebra benchmark. GPUs arebecoming increasingly popular for data analytics becauserelational algebra operations are easy to parallelize. Thisbenchmark acts like a filter for a set of records, returning asubset of the input records that meet a set of conditions. Itfirst maps each entry to a 1 or 0, before running a prefix sumand using both of these auxiliary data structures to reduce theinput data to just the matching entries.
4) DNN Kernels:
All benchmarks in this section representartificial neural network layers commonly seen in popularDNN models. All layers in this section include both forwardand backward passes.5 ctivation layer is used to decide whether a neuron shouldbe activated by calculating the weighted sum and adding biaswith it. It introduces non-linearity into the output of a neuron.Some of the most commonly used activation functions includeReLU, sigmoid, tanh, and LeakyReLU. Here we only presentReLU since it is the simplest one to understand. The followingdescribes the ReLU activation function. Note that x i representsthe input to the neuron and y i is the output. y i = max ( , x i ) (1) Pooling is common used between successive convolutionlayers in a ConvNet architecture. Its main function is to reducethe spatial dimensions on a convolution neural network. Forexample, applying a maxpool kernel of size 2 × × Batch normalization is a technique proposed to solvecovariate shift [20] in DNNs. When parameters in the pre-ceding layer change, the input to the current layer will changeaccordingly, causing the current layer to adjust to the newdistribution. The main goal of batch normalization is to limitthe shifting to a certain range to speedup training process andproduce reliable models.
Connected layers are those whose neurons are connected toevery neuron in the next layer. The connected layer can be seenas a feature vector that holds aggregated information from theprevious layer. For example, a connected layer can be rightafter a convolution layer which provides a low-dimensionalin-variate feature space. The fully connected layer can thenlearn a function from that space to produce more useful orabstract knowledge.
Convolution layer is mostly used to extract importantfeatures from images by assigning learning weights to variousobjects in those images. For example, give an RGB image ofsize 228 with 3 channels, we can train a convolution kernel ofsize 3 × ×
226 with 1 channel. The output tensorrepresents one feature in the image, such as the presence ofcurves in difference parts of the input image.
Dropout is a regularization technique used to prevent neuralnetworks from overfitting [21]. The key idea is to randomlydrop units from the neural network during training. Whentraining large neural networks on small data sets, overfittingcan be a huge issue when the model is evaluated on test dataset. Dropout solves this problem by stochastically introducingnoise to prevent units from co-adapting too much, thus makingthe model more robust.
RNN stands for Recurrent Neural Network. It is widelyadopted in learning tasks dealing with sequential data, suchas speed recognition, text generation, and so on. RNNs haveproven to be successful in capturing the dynamics of sequencesby keeping internal states(memory) which tracks informationfrom previous time stamps. Among the most commonly used RNNs are GRU and LSTM. In our benchmark, we only showresults for LSTM for simplicity.
Softmax layer is typically seen as the final output layer ina neural network to perform multi-class classification. It takesan input, usually a score value( z i , i = ... K ), and recomputesit as probabilities. Therefore, the outputs of the layer willrepresent a true probability distribution, where the sum of eachindividual output will equal to 1. Its calculation process isshown below: σ ( z c ) = e z c ∑ Kk = e z k (2) LRN (Local Response Normalization) is intended to sim-ulate a form of lateral inhibition [22] inspired by the typefound in real neurons. It allows diminishing response valuesuniformly large to neighborhoods and creates a high contrastin activation map. This feature is especially useful in unboundactivation functions such as ReLU. The original formula iswritten as b ix , y = a ix , y / ( k + α j = min ( N − , i + n / ) ∑ j = max ( , i − n / ) a ix , y ) β (3)where b ix , y is the regularized output for kernel i at position x , y , a ix , y is the source output of kernel i applied at position x , y , N is the number of kernels, n is the size of the normalizationneighbourhood, and α , β , k are hyper parameters of LRN.V. E VALUATION
In this section, we evaluate the applications in Mirovia interms of runtime characteristics, diversity, and performance.Our tests were performed on a machine with the followingspecifications: • Ubuntu 18.04.2 LTS • Linux 4.15.0-48-generic • CPU: 2 × Intel(R) Xeon(R) CPU E5-2650 v4 –
12 Core, 24 Thread – – –
30M L3 Cache • • GPU: Nvidia Tesla P100-SXM2 – Driver: 418.40.04 – – – A. Benchmark Performance
We use the nvprof profiling tool to collect the metricsgathered from running individual kernels. Note a number ofbenchmarks involve multiple kernels and some are redundant.We select the maximum utilization of each kernel and calculatetheir mean and corresponding standard deviation. The memoryand computational units utilization rate is provided in Figure 3and Figure 4.The utilization rate of different GPU components shows adiverse set of behaviors for both forward and backward passes6
ABLE IIS
ELECTED CU
DNN
KERNELS OF
DNN
BENCHMRKS
Benchmark Forward Kernel Backward Kernel
Activation op generic tensor kernel op generic tensor kernelPooling pooing fw 4d kernel pooing bw kernel avgBatchnorm bn fw tr 1C11 kernel NCHW bn bw 1C11 kernel newConnected maxwell sgemm 128x64 tn sgemm 128x128x8 TN vecConvolution maxwell scudnn 128x128 relu small nn wgrad alg0 engineDropout dropout fp dropout bpRNN maxwell sgemm 128x64 tn maxwell sgemm 128x64 nnSoftmax softmax fw kernel resident softmax bw kernelLRN lrnForward evenC lrnBackward evenC
Activation Pool Batchnorm Convolution Dropout FC LRN RNN Softmax
Benchmark U t ili z t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex
Fig. 3. Forward Kernel Utilization
Activation Pool Batchnorm Convolution Dropout FC LRN RNN Softmax
Benchmark U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex
Fig. 4. Backward Kernel Utilization b f s c f d d w t
2d ge mm gup s k m ean s l a v a m d n w pa t h f i nde r pa r t i c l e f il t e r s o r t s r ad w he r e Benchmarks U t ili z a t i on DRAM L2 Shared Unified Cache Control Flow Double P. Single P. Load/Store Special Tex
Fig. 5. Mirovia Utilization in DNN section of the benchmark. We observe the most uti-lized components are dram and single precision floating pointfunction unit, with backward average pool being an exceptionwith high utilization of shared memory and load/store functionunit.Table II shows the most relevant kernel of each DNNbenchmark. Each kernel’s collected metrics are presented inThe single precision floating point function unit is closely related to the IPC for most kernels. For example, both forwardand backward kernel passes for convolution results in highIPC. Low utilization of single precision floating point functionunit results in low IPC for batch normalization kernel. Theeligible number of warps per cycle also shows high numberfor convolution and low value for batch normalization. Thiscan be explained by that convolution has relatively good datalocality and spends less time waiting for data dependency to7e met, whereas batch normalization requires more memoryoperations which reduces the number of warps eligible to issuethe next instruction. This is a sign that convolution operationis compute bound and batch normalization is memory bound. A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmark U t ili z a t i on ForwardBackward A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmarks E li g i b l e W a r p s P e r C yc l e ForwardBackward A c t i v a t i on P oo li ng B a t c hno r m F C C on v o l u t i on D r opou t RNN S o ft m a x L RN Benchmarks A c h i e v ed O cc upan cy ForwardBackward
The utilization for the rest of all benchmarks in Mirovia showa diverse range of values. Each GPU component utilization isincreased compared to SHOC. This can be explained by theincrease in input data size, which demonstrates the importanceof having user-defined input problem size to stress hardware performance. These benchmarks also differ from DNN kernels.DNN kernel tend to stress dram and single precision functionunits heavily, while the conventional benchmarks exhibit amore diverse utilization of each component.
B. CUDA Feature Analysis
In this section, we analyzed benchmarks which implementnew CUDA features to find out how each feature affects theperformance of applications. To do this, we show the speedupof applications using the feature over various preset problemssizes available in Mirovia.
Unified Memory : For this feature, the kernel time plusthe transfer time of BFS without unified memory was com-pared to the kernel time of BFS with unified memory, sincethere is no explicit transfer time when using unified mem-ory. Three different versions of BFS using unified mem-ory were tested and each was compared to a version ofBFS that doesnt utilize any new features. The first ver-sion uses unified memory without cudaMemAdvise() or cudaMemPrefetchAsync() . The second version uses uni-fied memory with only cudaMemAdvise() , and the lastversion uses unified memory with both cudaMemAdvise() and cudaMemPrefetchAsync() . We found that BFS withunified memory was able to run faster than the baseline ver-sion only when prefetching was introduced. Additionally, thespeedup was fairly inconsistent and did not scale with the inputsize. This is because the execution path is highly dependenton the generated graph. Since data is randomly generated, thisintroduces randomness to the speedup over various problemsizes. The result is reasonable because constant demand pagingintroduces execution overheads. HyperQ : HyperQ was added to the level 1 Pathfinderbenchmark. Since this CUDA feature increases the utilizationwhen multiple independent kernel can execute concurrently,we just ran multiple instances of Pathfinder on differentstreams. The graph shows the the speedup as the number ofconcurrent pathfinder kernels increases. The transfer time isnot included because it would stay the same regardless.Our result shows that the speedup gained from HyperQincreases as the number of parallel kernels scales up. This8peedup levels out around 32 parallel instances. This makessense as at this point the benchmark is saturating all 32independent work queues. In addition to this we see speedupstarting at a little under 1x for a single instance, and up to 4xthereafter. This follows as increasing the number of instanceshere makes use of more of the work queues that are just sittingidle.
Cooperative Groups : Similar to HyperQ, the kernel timefor SRAD using a cooperative kernel was compared to thekernel time for the original SRAD implementation. In thiscase, the transfer time was not included because it wouldhave been the same for both. The biggest drawback of usingcooperative groups is the limit on the number of blocksable to launch. Because of this, SRAD using a cooperativekernel could not be run on image sizes greater than 256x256.Therefore, to get more data points, we varied the problem sizeby multiples of 16 instead of powers of 2.
Dynamic Parallelism : For this feature, the speedup wasmeasured using the kernel times for Mandelbrot with andwithout Dynamic Parallelism. Like most of the other features,transfer time was not included in the speedup because it wasthe same for both versions of the benchmark. This benchmarkshows one of the cleanest increases in speedup as problemsizes increase. This primarily comes down to the efficiencyof the two algorithms used and what Dynamic Parallelism allows. While the traditional Escape Time algorithm is forcedto calculate values for every pixel, Mariani-Silver is allowedto subdivide and thus ignore ever increasing swaths of theimage. This is shown by the increasing speedup as imagesize increases. While this benchmarks inclusion of dynamicparallelism is used for regular parallelism, this feature canalso be used for more varied implementations. The Mandelbrotkernel explicitly calls itself over and over on smaller patchesof the image. Other programs may choose to have a masterkernel that calls multiple diverse sub-kernels. Mandelbrot isstill enough to stress Dynamic Parallelism and is thus whatwe chose to include in Mirovia.VI. C
ONCLUSION AND F UTURE W ORK
In creating Mirovia we aimed to modernize aspects ofpopular existing suites such as Rodinia and SHOC. To do this,we improved the spread of benchmarks included, bringing innew programs from different domains, while also adaptingproblem sizes to the abilities of modern hardware. Mostimportantly, we added support for measuring the performanceof many new features that were introduced in recent years.Features like Cooperative Groups, HyperQ, and half precisionarithmetic are all new enough that no current suite tests themat all. To capture the characteristics of DNN’s behaviors onGPUs, we included a set of popular neural network kernels. Inthis way we present Mirovia as a more complete benchmarksuite for the modern era. Plans for future work include: • Add benchmark support for GPUDirect RDMA. This is afeature that allows for direct data exchange between de-vices on a PCI bus. Utilizing this greatly reduces the timerequired for data transfers by bypassing typical memorycopies across various data planes. This is especially usefulfor pipe-lined data processing workloads where there aremultiple disjoint transformations acting on the same data.GPUDirect RDMA allows for data to stay off host theentire time, moving directly from stage to stage of theworkload. • Explore our benchmark diversity analysis by using thePrincipal Component Analysis (PCA) and HierarchicalClustering Analysis described by this paper [23].9
Incorporate new CUDA features such as CUDA graphsinto our benchmark to facilitate program speedup. • Add new benchmarks to test new hardware featuressuch as tensor cores [24] in more recent architectures.Tensor Core is a specialized hardware units designed forperforming mixed precision matrix computations com-monly used in deep learning neural network training andinference applications.R
EFERENCES[1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, andK. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”in , pp. 44–54, Oct 2009.[2] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spaf-ford, V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous com-puting (shoc) benchmark suite,” in
Proceedings of the 3rd Workshop onGeneral-Purpose Computation on Graphics Processing Units , GPGPU-3, (New York, NY, USA), pp. 63–74, ACM, 2010.[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wat-tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scalemachine learning on heterogeneous systems,” 2015. Software availablefrom tensorflow.org.[4] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,”
NIPS-W , 2017.[5] K. Asanovi, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, andK. A. Yelick, “The landscape of parallel computing research: A viewfrom berkeley,” Tech. Rep. UCB/EECS-2006-183, EECS Department,University of California, Berkeley, Dec 2006.[6] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, “Lonestar:A suite of parallel irregular programs,” in ,(Los Alamitos, CA, USA), IEEE Computer Society, apr 2009.[7] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia:Understanding irregular gpgpu graph applications,” in , pp. 185–195, Sep. 2013.[8] M. A. O’Neil and M. Burtscher, “Microarchitectural performancecharacterization of irregular gpu kernels,” in , pp. 130–139, Oct2014.[9] M. D. Sinclair, J. Alsop, and S. V. Adve, “Heterosync: A benchmarksuite for fine-grained synchronization on tightly coupled gpus,” in ,pp. 239–249, Oct 2017.[10] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee,S. W. Keckler, M. T. Kandemir, and C. R. Das, “Anatomy of gpumemory system for multi-application execution,” in
Proceedings of the2015 International Symposium on Memory Systems , MEMSYS ’15,(New York, NY, USA), pp. 223–234, ACM, 2015.[11] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee,B. Schroeder, and G. Pekhimenko, “Benchmarking and analyzing deepneural network training,” in , pp. 88–100, Sep. 2018.[12] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng,N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzingmachine learning workloads using a detailed gpu simulator,” in , pp. 151–152, March 2019.[13] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abelln,J. Kim, D. Kaeli, and A. Joshi, “Profiling dnn workloads on a volta-baseddgx-1 system,” in , pp. 122–133, Sep. 2018. [14] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, andH. Jeon, “Tango: A deep neural network benchmark suite for variousaccelerators,” jan 2019.[15] S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmarksuite for gpus,” in
Proceedings of the General Purpose GPUs , GPGPU-10, (New York, NY, USA), pp. 63–72, ACM, 2017.[16] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao,C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaem-maghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg,J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon,J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps,J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov,M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson,B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of atensor processing unit,” in , pp. 1–12, June 2017.[17] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “Colt: Coa-lesced large-reach tlbs,” in , pp. 258–269, Dec 2012.[18] J. Redmon, “Darknet: Open source neural networks in c.” http://pjreddie.com/darknet/, 2013–2016.[19] N. Satish, M. Harris, and M. Garland, “Designing efficient sortingalgorithms for manycore gpus,” in , pp. 1–10, May 2009.[20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in
ICML , 2015.[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from overfit-ting,”
Journal of Machine Learning Research , vol. 15, pp. 1929–1958,2014.[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in Neural Infor-mation Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc.,2012.[23] N. Goswami, R. Shankar, M. Joshi, and T. Li, “Exploring gpgpu work-loads: Characterization methodology, analysis and microarchitectureevaluation implications,” in
IEEE International Symposium on WorkloadCharacterization (IISWC’10) , pp. 1–10, Dec 2010.[24] S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter,“Nvidia tensor core programmability, performance & precision,” 032018., pp. 1–10, Dec 2010.[24] S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter,“Nvidia tensor core programmability, performance & precision,” 032018.