[PDF] A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics (Extended Version)

Abstract

Full PDF

AA Study of the Fundamental Performance Char-acteristics of GP Us and CP Us for Database Analytics(Extended Version)

Anil Shanbhag

[email protected]

Samuel Madden

[email protected]

Xiangyao Yu

University of [email protected]

ABSTRACT

There has been significant amount of excitement and recentwork on GPU-based database systems. Previous work hasclaimed that these systems can perform orders of magnitudebetter than CPU-based database systems on analytical work-loads such as those found in decision support and businessintelligence applications. A hardware expert would view theseclaims with suspicion. Given the general notion that databaseoperators are memory-bandwidth bound, one would expectthe maximum gain to be roughly equal to the ratio of the mem-ory bandwidth of GPU to that of CPU. In this paper, we adopta model-based approach to understand when and why the per-formance gains of running queries on GPUs vs on CPUs varyfrom the bandwidth ratio (which is roughly 16 × on modernhardware). We propose Crystal, a library of parallel routinesthat can be combined together to run full SQL queries on aGPU with minimal materialization overhead. We implementindividual query operators to show that while the speedupsfor selection, projection, and sorts are near the bandwidth ra-tio, joins achieve less speedup due to differences in hardwarecapabilities. Interestingly, we show on a popular analyticalworkload that full query performance gain from running onGPU exceeds the bandwidth ratio despite individual operatorshaving speedup less than bandwidth ratio, as a result of lim-itations of vectorizing chained operators on CPUs, resultingin a 25 × speedup for GPUs over CPUs on the benchmark. In the past decade, special-purpose graphics processing units(GPUs) have evolved into general purpose computing devices,with the advent of general purpose parallel programmingmodels, such as CUDA [3] and OpenCL [7]. Because of GPU’shigh compute power, they have seen significant adoptionin deep learning and in high performance computing [4].GPUs also have significant potential to accelerate memory-bound applications such as database systems. GPUs utilizeHigh-Bandwidth Memory (HBM), a new class of RAM thathas significantly higher throughput compared to traditionalDDR RAM used with CPUs. A single modern GPU can haveup to 32 GB of HBM capable of delivering up to 1.2 TBps of memory bandwidth and 14 Tflops of compute. In contrast,a single CPU can have hundreds of GB of memory with upto 100 GBps memory bandwidth and 1 TFlop of compute.This rise in memory capacity, coupled with the ability toequip a modern server with several GPUs (up to 20), meansthat it’s possible to have hundreds of gigabytes of GPUmemory on a modern server. This is sufficient for manyanalytical tasks; for example, one machine could host severalweeks of a large online retailer’s (with say 100M sales per day)sales data (with 100 bytes of data per sale) in GPU memory,the on-time flight performance of all commercial airlineflights in the last few decades, or the time, location, and(dictionary encoded) hash tags used in every of the severalbillion tweets sent over the past few days.In-memory analytics is typically memory bandwidthbound. The improved memory bandwidth of GPUs has ledsome researchers to use GPUs as coprocessors for analyticquery processing [15, 24, 44, 48]. However, previous workleaves several unanswered questions: • GPU-based database systems have reported a wide range ofperformance improvement compared to CPU-based data-base systems, ranging from 2 × to 100 × . There is a lack ofconsensus on how much performance improvement can beobtained from using GPUs. Past work frequently comparesagainst inefficient baselines, e.g., MonetDB [24, 44, 48]which is known to be inefficient [29]. The empirical natureof past work makes it hard to generalize results acrosshardware platforms. • Past work generally views GPUs strictly as an coprocessor.Every query ends up shipping data from CPU to GPU overPCIe. Data transfer over PCIe is an order of magnitudeslower than GPU memory bandwidth, and typically lessthan the CPU memory bandwidth. As a result, the PCIetransfer time becomes the bottleneck and limits gains. Tothe extent that past work shows performance improve-ments using GPUs as an coprocessor, much of those gainsmay be due to evaluation against inefficient baselines. • There has been significant improvement in GPU hard-ware in recent years. Most recent work on GPU-baseddatabase [15] evaluates on GPUs which have memorycapacity and bandwidth of 4 GB and 150 GBps respectively, a r X i v : . [ c s . D B ] M a r hile latest generation of GPUs have 8 × higher capacityand bandwidth. These gains significantly improve theattractiveness of GPUs for query processing.In this paper, we set out to understand the true nature ofperformance difference between CPUs and GPUs, by perform-ing rigorous model-based and performance-based analysisof database analytics workloads after applying optimizationsfor both CPUs and GPUs. To ensure that our implementationsare state-of-the-art, we use theoretical minimums derivedassuming memory bandwidth is saturated as a baseline, andshow that our implementations can typically saturate thememory bus, or when they cannot, describe in detail why theyfall short. Hence, although we offer some insights into the bestimplementations of different operators on CPUs and GPUs,the primary contribution of this paper is to serve as a guideto implementors as to what sorts of performance differencesone should expect to observe in database implementationson modern versions of these different architectures.Past work has used GPUs mainly as coprocessors. By com-paring an efficient CPU implementation of a query processorversus an implementation that uses the GPU as a coprocessor,we show that GPU-as-coprocessor offers little to no gain overa pure CPU implementation, performing worse than the CPUversion for some queries. We argue that the right setting ishaving the working set stored directly on GPU(s).We developed models and implementations of basicoperators: Select, Project, and Join on both CPU and GPUto understand when the ratio of operator runtime on CPUsto runtime on GPUs deviates from the ratio of memorybandwidth of GPU to memory bandwidth of CPU. In theprocess, we noticed that the large degree of parallelism ofGPUs leads to additional materialization. We propose a novelexecution model for query processing on GPUs called the Tile-based execution model . Instead of looking at GPU threadsin isolation, we treat a block of threads (“thread block”) asa single execution unit, with each thread block processinga tile of items. The benefit of this tile-based execution modelis that thread blocks can now cache tiles in shared memoryand collectively process them. This helps avoid additionalmaterialization. This model can be expressed using a set ofprimitives where each primitive is a function which takes asinput of set of tiles and outputs a set of tiles. We call these prim-itives block-wide functions . We present

Crystal , a library ofblock-wide functions that can be used to implement the com-mon SQL operators as well as full SQL queries. Furthermore,we use

Crystal to implement the query operators on theGPU and compare their performance against equivalent state-of-the-art implementations on the CPU. We use

Crystal to implement the Star-Schema Benchmark (SSB) [30] on theGPU and compare it’s performance against our own CPUimplementation, a state-of-the-art CPU-based OLAP DBMSand a state-of-the-art GPU-based OLAP DBMS. In both cases, we develop models assuming memory bandwidth is saturatedand reason about the performance based on it.In summary, we make the following contributions: • We show that previous designs which use the GPU as acoprocessor show no performance gain when comparedagainst a state-of-the-art CPU baseline. Instead, usingmodern GPU’s increased memory capacity to storeworking set directly on the GPU is a better design. • We present

Crystal , a library of data processing primitivesthat can be composed together to generate efficient querycode that can full advantage of GPU resources. • We present efficient implementations of individual opera-tors for both GPU and CPU. For each operator, we providecost models that can accurately predict their performance. • We describe our implementation of SSB and evaluate bothGPU and CPU implementations of it. We present costmodels that can accurately predict query runtimes on theGPU and discuss why such models fall short on the CPU.

In this section, we review the basics of the GPU architectureand describe relevant aspects of past approaches to runningdatabase analytics workloads on CPU and GPU.

Many database operations executed on the GPU areperformance bound by the memory subsystem (eithershared or global memory) [48]. In order to characterize theperformance of different algorithms on the GPU, it is, thus,critical to properly understand its memory hierarchy.Figure 1 shows a simplified hierarchy of a modern GPU.The lowest and largest memory in the hierarchy is the global memory . A modern GPU can have global memorycapacity of up to 32 GB with memory bandwidth of up to1200 GBps. Each GPU has a number of compute units called

Streaming Multiprocessors (SMs) . Each SM has a number ofcores and a fixed set of registers. Each SM also has a sharedmemory which serves as a scratchpad that is controlled bythe programmer and can be accessed by all the cores in theSM. Accesses to global memory from a SM are cached in theL2 cache (L2 cache is shared across all SMs) and optionallyalso in the L1 cache (L1 cache is local to each SM).Processing on the GPU is done by a large number of threadsorganized into thread blocks (each run by one SM). Threadblocks are further divided into groups of threads called warps(usually consisting of 32 threads). The threads of a warpexecute in a

Single Instruction Multiple Threads (SIMT) model,where each thread executes the same instruction streamon different data. The device groups global memory loadsand stores from threads in a single warp such that multipleloads/stores to the same cache line are combined into a single

M-1RegistersL1 SMEM SM-2RegistersL1 SMEM SM-NRegistersL1 SMEM

L2 CacheGlobal Memory

Off chipOn chip

Figure 1: GPU Memory Hierarchy request. Maximum bandwidth can be achieved when a warp’saccess to global memory results in neighboring locationsbeing accessed.The programming model allows users to explicitly allocateglobal memory and shared memory in each thread block.Shared memory has an order of magnitude higher bandwidththan global memory but has much smaller capacity (a fewMB vs. multiple GB).Finally, registers are the fastest layer of the memoryhierarchy. If a thread block needs more registers thanavailable, register values spill over to global memory.

With the slowing of Moore’s Law, CPU performance hasstagnated. In recent years, researchers have started exploringheterogeneous computing to overcome the scaling problemsof CPUs and to continue to deliver interactive performancefor database applications. In such a hybrid CPU-GPUsystem, the two processors are connected via PCIe. The PCIebandwidth of a modern machine is up to 16 GBps, which ismuch lower than the memory bandwidth of either CPU orGPU. Therefore, data transfer between CPU and GPU is aserious performance bottleneck.Past work in the database community has focused on usingthe GPU as a coprocessor, which we call the coprocessor model .In this model, data primarily resides in CPU’s main memory.For query execution, data is shipped from the CPU to the GPUover PCIe, so that (some) query processing can happen on theGPU. Results are then shipped back to the CPU. Researchershave worked on optimizing various database operations un-der the co-processor model: selection [40], join [18, 19, 22,35, 38, 39, 47], and sort [16, 42]. Several full-fledged GPU-as-coprocessor database query engines have been proposed inrecent years. Ocelot [20] provides a hybrid analytical engineas an extension to MonetDB. YDB [48] is a GPU-based datawarehousing engine. Both systems used an operator-at-a-time model, where an operator library containing GPU kernelimplementations of common database operators such as scansand joins is invoked on batches of tuples, running each opera-tor to completion before moving on to the next operator. Ker-nel fusion [46] attempted to hide in-efficiency associated with running multiple kernels for each query like in the operator-at-a-time model. Kernel fusion fused operator kernels withproducer-consumer dependency when possible to eliminateredundant data movement. As kernel fusion is applied as apost-processing step, it will miss opportunities where kernelconfigurations are incompatible (like the one in describedin Section 3.2). HippogriffDB [24] used GPUs for large scaledata warehousing where data resides on SSDs. HippogriffDBclaims to achieve 100 × speedup over MonetDB when the ratioof memory bandwidth of GPU to CPU is roughly 5 × . We havenot been able to get the source code to compare against thesystem. More recently, HorseQC [15] proposes pipelined datatransfer between CPU and GPU to improve query runtime.As we show in the next section, using HorseQC ends up beingslower than running the query efficiently directly on the CPU.Commercial systems like Omnisci [6], Kinetica [5], andBlazingDB [2] aim to provide real-time analytical capabilitiesby using GPUs to store large parts (or all) of the workingset. The setting used in this paper is similar to ones used bythese systems. Although these systems use a design similarto what we advocate, some have claimed 1000 × performanceimprovement by using GPUs [1] but have not publishedrigorous benchmarks against state-of-the art CPU or GPUdatabases, which is the primary aim of this paper. Database operators have been extensively optimized formodern processors. For joins, researchers have proposedusing cache-conscious partitioning to improve hash joinperformance [9–11, 25]. Schuh et al. summarized theapproaches [37]. For sort, Satish et al. [36] and Wassenberget al. [45] introduced buffered partitioning for radix sort.Polychroniou et al. [32] presented faster variants of radix sortthat use SIMD instructions. Sompolski et al. [41] showed thatcombination of vectorization and compilation can improveperformance of project, selection, and hash join operators.Polychroniou et al. [31] presented efficient vectorized designsfor selections, hash tables, and partitioning using SIMDgathers and scatters. Prashanth et al. [27] extended theidea to generate machine code for full queries with SIMDoperators. We use ideas from these works, mainly the worksof Polychroniou et al. [31, 32] for our CPU implementations.C-Store [43] and MonetDB [12] were among the firstcolumn-oriented engines, which formed the basis for analyti-cal query processing. MonetDB X100 [13] introduced the ideaof vectorized execution that was cache aware and reducedmemory traffic. Hyper [29] introduced the push-basediteration and compiling queries into machine code usingLLVM. Hyper was significantly faster than MonetDB andbrought query performance close to that of handwrittenC code. We compare the performance of our CPU queryimplementations against MonetDB [12] and Hyper [29].

ELECT SUM(lo_extendedprice * lo_discount) AS revenueFROM lineorderWHERE lo_quantity < 25AND lo_orderdate >= 19930101 AND lo_orderdate <= 19940101AND lo_discount >= 1 AND lo_discount <= 3;

Figure 2: Star Schema Benchmark Q1.1 q . q . q . q . q . q . q . q . q . q . q . q . q . m e a n Queries T i m e ( i n m s ) MonetDBGPU CoprocessorHyper

Figure 3: Evaluation on the Star Schema Benchmark

In this section, we describe the tile-based execution modelwe use to execute queries on GPU efficiently. We begin byshowing why the coprocessor model used by past works isa suboptimal design and motivate why storing the workingset directly on the GPU in a heterogeneous system (as doneby all commercial systems) is a better approach. Throughan example, we illustrate the unique challenges associatedwith running queries in a massively-parallel manner on theGPU. We show how by treating the thread block as the basicexecution unit with each thread block processing a tile ofitems (similar to vector-based processing on the CPU whereeach thread processes a vector of items a time) leads to goodperformance on the GPU. We call this approach the tile-basedexecution model. Finally, we show how this model can beexpressed using a set of primitives where each primitive isa function which takes as input of set of tiles and outputs aset of tiles. We call these primitives block-wide functions. Wepresent

Crystal , a library of block-wide functions that canbe composed to create a full SQL query.

While past work has claimed speedups from using GPUs in thecoprocessor model, there is no consensus among past workabout the performance improvement obtained from usingGPUs, with reported improvements varying from 2 × to 100 × .Consider Q1.1 from the Star Schema Benchmark (SSB)shown in Figure 2. For simplicity, assume all columnentries are 4-byte integers and L is the number of entries in lineorder . An efficient implementation on a CPU will beable to generate the result using a single pass over the 4 datacolumns. The optimal CPU runtime ( R C ) is upper bounded by 16 L / B c where B c is the CPU memory bandwidth. This isan upper bound because, if the predicates are selective, thenwe may be able to skip entire cache lines while accessingthe lo_extendedprice column. In the coprocessor model,we have to ship 4 columns of data to GPU. Thus, the queryruntime on the GPU ( R G ) is lower bounded by 16 L / B p where B p is the PCIe bandwidth. The bound is hit if we are able toperfectly overlap the data transfer and query execution onGPU. However, since B c > B p in modern CPU-GPU setups, R C < R G , i.e., running the query on CPU yields a lowerruntime than the running query with a GPU coprocessor.To show this empirically, we ran the entire SSB with scalefactor 20 on an instance where CPU memory bandwidthis 54 GBps, GPU memory bandwidth is 880 GBps, andPCIe bandwidth is 12.8 GBps. The full workload details canbe found in Section 5.1 and the full system details can befound in Table 2. We compare the performance of the GPUCoprocessor with two OLAP DBMSs:

MonetDB and

Hyper .Past work on using GPUs as a coprocessor mostly comparedtheir performance against MonetDB [24, 44, 48] which isknown to be inefficient [29]. Figure 3 shows the results. Onan average,

GPU Coprocessor performs 1 . × faster than MonetDB but it is 1 . × slower than Hyper . For all queries, thequery runtime in

GPU coprocessor is bound by the PCIetransfer time. We conclude the reason past work was ableto show performance improvement with a GPU coprocessoris because their optimized implementations were comparedagainst inefficient baselines (e.g., MonetDB) on the CPU.With the significant increase in GPU memory capacity,a natural question is how much faster a system that treatsthe GPU as the primary execution engine, rather than as anaccelerator, can be. We describe our architecture for such asystem in the rest of this section.

While a modern CPU can have dozens of cores, a modernGPU like Nvidia V100 can have 5000 cores. The vast increasein parallelism introduces some unique challenges for dataprocessing. To illustrate this, consider running the followingsimple selection query as a micro-benchmark on both a CPUand a GPU:

Q0 : SELECT y FROM R WHERE y > v;

On the CPU, the query can be efficiently executed as follows.The data is partitioned equally among the cores. The goal isto write the results in parallel into a contiguous output array.The system maintains a global atomic counter that acts as acursor that tells each thread where to write the next result.Each core processes its partition by iterating over the entriesin the partition one vector of entries at a time, where a vectoris about 1000 entries (small enough to fit in the L1 cache). Eachcore makes a first pass over the first vector of entries to counthe number of entries that match the predicate d . The threadincrements the global counter by d to allocate output spacefor the matching records, and then does a second pass over thevector to copy the matched entries into the output array in theallocated range of the output. Since the second pass reads datafrom L1 cache, the read is essentially free. The global atomiccounter is a potential point of contention. However, note thateach thread updates the counter once for every 1000+ entriesand there are only around 32 threads running in parallel at anypoint. The counter ends up not being the bottleneck and thetotal runtime is approximately DB C + DσB C where D is the sizeof the column, and B C is the memory bandwidth on the CPU.We could run the same plan on the GPU, partitioning thedata up among the thousands of threads. However, GPUthreads have significantly fewer resources per thread. Onthe Nvidia V100, each GPU thread can only store roughly 244-byte entries in shared memory at full occupancy, with 5000threads running in parallel. Here, the global atomic counterends up becoming the bottleneck as all the threads will attemptto increment the counter to find the offset into the output array.To work around this, existing GPU-based database systemswould execute this query in 3 steps as shown in Figure 4(a).The first kernel K would be launched across a large numberof threads. In it, each thread would read in column entries ina strided fashion (interleaved by thread number) and evaluatethe predicate to count the number of entries matched. Afterprocessing all elements, the total number of entries matchedper thread would be recorded in an array count , where count[t] is number of entries matched by thread t . The sec-ond kernel K would use the count array to compute the prefixsum of the count and store this in another array pf . Recall thatfor an array A of k elements, the prefix sum p A is a k elementarray where p A [ j ] = (cid:205) j − i = A j . Thus, the i th entry in pf indicatesthe offset at which the i th thread should write its matchedresults to in the output array o . Databases used an optimizedroutine from a CUDA library like Thrust [8] to run it efficientlyin parallel. The third kernel K would then read in the input col-umn again; here the i th thread again scans the i th stride of theinput, using pf [ i ] to determine where to write the satisfyingrecords. Each thread also maintains a local counter c i , initiallyset to 0. Specifically for each satisfying entry, thread i writes itto pf [ i ] + c i and then increments c i . In the end, o[pf[t]] ...o[pf[t+1] - 1] will contain the matched entries of thread t .The above approach shifts the task of finding offsets intothe output array to an optimized prefix sum kernel whoseruntime is a function of T (where T is the number of threads ( T << n ) ), instead of finding it inline using atomic updates to acounter. As a result, the approach ends up being significantlyfaster than the naive translation of the CPU approach tothe GPU. However, there are a number of issues with thisapproach. First, it reads the input column from global memory Load tile of itemsGenerate BitmapCompute Block-widePrefix Sum Atomic update global counterBlock-wide shuffleCoalesced WriteRead entriesEvaluate predicateand count matchedPrefix sum over countWrite out matched entries at appropriate offsetWrite count Read entries, prefix sum (a) Current (b) With Tile-based processingKernel

K1K2K3 K1

Figure 4: Running selection on GPU threadVector (~1000 elems) threadblock Tile(~ 1000 elems)(a) On CPU (b) On GPU number of threadselemsperthread

Figure 5: Vector-based to Tile-based execution models. twice, compared to doing it just once with the CPU approach.It also reads/writes to intermediate structures count and pf . Finally, each thread writes to a different location in theoutput array resulting in random writes. To address theseissues, we introduce the Tile-based execution model .Tile-based processing extends the vector-based processingon CPU where each thread processes a vector at a time tothe GPU. Figure 5 illustrates the model. As discussed earlierin Section 2.1, threads on the GPU are grouped into threadblocks. Threads within a thread block can communicatethrough shared memory and can synchronize throughbarriers. Hence, even though a single thread on the GPUat full occupancy can hold only up to 24 integers in sharedmemory, a single thread block can hold a significantly largergroup of elements collectively between them in sharedmemory. We call this unit a

Tile . In the Tile-based executionmodel, instead of viewing each thread as an independentexecution unit, we view a thread block as the basic executionunit with each thread block processing a tile of entries at atime. One key advantage of this approach is that after a tileis loaded into shared memory, subsequent passes over the tilewill be read directly from shared memory and not from globalmemory, avoiding the second pass through global memorydescribed in the implementation above.Figure 4(b) shows how selection is implemented using thetile-based model. The entire query is implemented as a singlekernel instead of three. Figure 6 shows a sample executionwith a tile of size 16 and a thread block of 4 threads for thepredicate y >

5. Note that this is just for illustration, as most Generate bitmap Compute histogramGenerate prefix sum Load tileGen shuffled tile global counter

Input columnResult Array global memory shared memory X accesses by thread 0 Figure 6: Query Q0 Kernel running y > modern GPUs would use a thread block size that is a multipleof 32 (the warp size) and the number of elements loaded wouldbe 4–16 times the size of the thread block. We start by initial-izing the global counter to 0. The kernel loads a tile of itemsfrom global memory into the shared memory. The threadsthen apply the predicate on all the items in parallel to generatea bitmap. For example, thread 0 evaluates the predicate forelements 0,4,8,12 (shown in red). Each thread then counts thenumber of entries matched per thread to generate a histogram.The thread block co-operatively computes the prefix sumover the histogram to find the offset each thread writes to inshared memory. In the example, threads 0,1,2,3 match 2,1,4,3entries respectively. The prefix sum entries 0,2,3,7 tell usthread 0 should write its matched entries to output at index 0,thread 1 should write starting at index 2, etc. We increment aglobal counter atomically by total number of matched entiresto find the offset at which the thread block should write in theoutput array. The shuffle step uses the bitmap and the prefixsum to create a contiguous array of matched entries in sharedmemory. The final write step copies the contiguous entriesfrom shared memory to global memory at the right offset.By treating the thread block as an execution unit, we reducethe number atomic updates of the global counter by a factor ofsize of tile T . The kernel also makes a single pass over the inputcolumn with the Gen Shuffled Tile ensuring that the finalwrite to the output array is coalesced, solving both problemsassociated with approach used in previous GPU databases.The general concept of the tile-based executing model i.e.,dividing data into tiles and mapping threadblocks to tileshas been used in other domains like image processing [21]and high performance computing [8]. However, to the best ofour knowledge this is the first work that uses it for databaseoperations. In the next section, we present

Crystal , a libraryof data processing primitives that can be composed togetherto implement SQL queries on the GPU.

The kernel structure in Figure 6 contains a series of stepswhere each is a function that takes as input a set of tiles, and

BlockLoad yBlockPred y > vBlockScanAtomic update global counterBlockShuffleBlockStore(a) SELECT y FROM R WHERE y > v K1 BlockLoadSel yAndPred y > v BlockScanAtomic update global counterBlockShuffleBlockStore K1 BlockLoad xBlockPred x > w Block-wide function (b) SELECT y FROM R WHERE x > w AND y > v

Figure 7: Implementing queries using Crystal // Implements SELECT y FROM R WHERE y > v // NT => NUM_THREADS // IPT => ITEMS_PER_THREAD template __global__ void Q(int* y, int* out, int v, int* counter) { int tile_size = get_tile_size(); int offset = get_tile_offset(); __shared__ struct buffer { int col[NT * IPT]; int out[NT * IPT]; }; int items[IPT]; int bitmap[IPT]; int indices[IPT]; BlockLoadInt(col+offset,items,buffer.col,tile_size); BlockPredIntGT(items,buffer.col,cutoff,bitmap); BlockScan(bitmap,indices,buffer.col, num_selections,tile_size); if(threadIdx.x == 0) o_off = atomic_update(counter,num_selections); BlockShuffleInt(items,indices,buffer.out); BlockStoreInt(buffer.out,out + o_off,num_selections); } Figure 8: Query Q0 Kernel Implemented with

Crystal outputs a set of tiles. We call these primitives block-wide func-tions . A block-wide function is a device function that takesin a set of tiles as input, performs a specific task, and outputsa set of tiles. Instead of reimplementing these block-widefunctions for each query, which would involve repetition ofnon-trivial functions, we developed a library called Crystal . Crystal is a library of templated CUDA device functionsthat implement the full set of primitives necessary for execut-ing typical analytic SQL SPJA analytical queries. Figure 7(a)shows an sketch of the simple selection query implementedusing block-wide functions. Figure 8 shows the query kernelof the same query implemented with Crystal . We use this Device functions are functions that can be called from kernels on the GPU The source code of the Crystal library is available at https://github.com/anilshanbhag/crystal rimitive Description

BlockLoad

Copies a tile of items from global memory to shared memory. Uses vector instructions to load full tiles.

BlockLoadSel

Selectively load a tile of items from global memory to shared memory based on a bitmap.

BlockStore

Copies a tile of items in shared memory to device memory.

BlockPred

Applies a predicate to a tile of items and stores the result in a bitmap array.

BlockScan

Co-operatively computes prefix sum across the block. Also returns sum of all entries.

BlockShuffle

Uses the thread offsets along with a bitmap to locally rearrange a tile to create a contiguous array of matched entries.

BlockLookup

Returns matching entries from a hash table for a tile of keys.

BlockAggregate

Uses hierarchical reduction to compute local aggregate for a tile of items.

Table 1: List of block-wide functions example to illustrate the key features of

Crystal . The inputtile is loaded from the global memory into the thread blockusing

BlockLoad . BlockLoad internally uses vector instruc-tions when loading a full tile and for the tail of the input arraythat may not form a perfect tile, it is loaded in a striped fashionelement-at-a-time.

BlockPred applies the predicate to gen-erate the bitmap. A key optimization that we do in

Crystal is instead of storing the tile in shared memory, in cases wherethe array indices are statically known before hand, we chooseto use registers to store the values. In this case, items (whichcontains entries loaded from the column) and bitmap arestored in registers. Hence, in addition to 24 4-byte valuesthat a thread can store in shared memory, this technique al-lows us to use roughly equal amount of registers availableto store data items. Next we use

BlockScan to compute theprefix sum.

BlockScan internally implements a hierarchicalblock-wide parallel prefix-sum approach [17]. This involvesthreads accessing bitmap entries of other threads — for thiswe load bitmap into shared memory, reusing buffer.col shared memory buffer used for loading the input column.Shared memory is order of magnitude faster than global mem-ory, hence loads and stores to shared memory in this case donot impact performance. After atomic update to find offset inoutput array,

BlockShuffle is used to reorder the array andfinally we use

BlockStore to write to output array. The codeskips some minor details like when the atomic update happens,since it is executed on thread 0, the global offset needs to becommunicated back to other threads through shared memory.In addition to allowing users to write high performancekernel code that as we show later can saturate memory band-width, there are two usability advantages of using

Crystal : • Modularity:

Block-wide functions in

Crystal makeit easy to use non-trivial functions and reduce boil-erplate code. For example,

BlockScan , BlockLoad , BlockAggregate each encapsulate 10’s to 100’s of lines ofcode. For the selection query example,

Crystal reduceslines of code from more than 300 to less than 30. • Extensibility:

Block-wide functions makes it is fairly easyto implement query kernels of larger queries. Figure 7(b)shows the implementation of a selection query with twopredicates. Ordinary CUDA code can be used along with

Crystal functions.

Crystal supports loading partial tiles like in Figure 7(b).If a selection or join filters entries, we use

BlockLoadSel toload items that matched the previous selections based on abitmap. In this case, the thread block internally allocate spacefor the entire tile, however, only matched entries are loadedfrom global memory. Table 1 briefly describes the block-widefunctions currently implemented in the library.To evaluate Crystal, we look at two microbenchmarks:1) We evaluate the selection query Q and selectivity is 0 .

5. We vary the tile sizes. We vary thethread block sizes from 32 to 1024 in multiples of 2. We havethree choices for the number of items per thread: 1,2,4. Fig-ure 9 shows the results. As we increase the thread block size,the number of global atomic updates done reduces and hencethe runtime improves until the thread block size approaches512 after which it deteriorates. Each streaming multiprocessoron the GPU holds maximum of 2048 threads, hence, havinglarge thread blocks reduces number of independent threadblocks. This affects utilization particularly when threadblocks are using synchronization heavily. Having 4 items perthread allows to effectively load the entire block using vectorinstructions. With 2 items per thread, there is reduced benefitfor vectorization as half the threads are empty. With 1 itemper thread there is no benefit. The best performance is seenwith thread block size of 128/256 and items per thread equalto 4. In these cases, as we show later in Section 4.2 saturatememory bandwidth and hence achieve optimal performance.2) We evaluated the selection query Q Crystal (Figure 4(b)). The number of entries in the inputarray is 2 and selectivity is 0 .

5. The runtime with theindependent threads approach is 19ms compared to just2.1ms when using

Crystal . Almost all of the performanceimprovement is from avoiding atomic contention and beingable to reorder matched entries to write in a coalesced manner.Across all of the workloads we evaluated, we found thatusing thread block size 128 with items per thread equal to 4is indeed the best performing tile configuration. In the rest ofthe paper, we use this configuration for all implementationsusing

Crystal . All the implementations with

Crystal areimplemented in CUDA C++. Since

Crystal ’s block-wide

Thread Blocksize T i m e T a k e n ( i n m s ) Items Per Thread

Figure 9: Q0 performance with varying tile sizes functions are standard device functions, they can also calleddirectly from LLVM IR.In the next section, we show how to use these block-widefunctions to build efficient operators on a GPU and comparetheir performance to equivalent CPU implementations.

In order to understand the true nature of performance differ-ence of queries on GPU vs. CPU, it is important to understandthe performance difference of individual query operators. Inthis section, we compare the performance of basic operators:project, select, and hash join on GPU and CPU with the goalof understanding how the ratio of runtime on GPU to runtimeon CPU compares to the bandwidth ratio of the two devices.We use block-wide functions from

Crystal to implement theoperators on GPU and use equivalent state-of-the-art imple-mentations on CPU. We also present a model for each of the op-erators assuming the operator saturates memory bandwidthand show that in most cases the operators indeed achieve theselimits. We use the model to explain the performance differencebetween CPU and GPU. For the micro-benchmarks, we usea setup where GPU memory bandwidth is 880GBps and CPUmemory bandwidth is 54GBps, resulting in a bandwidth ratioof 16 . We consider two forms of projection queries: one thatcomputes a linear combination of columns (Q1) and oneinvolving user defined function (Q2) as shown below:

Q1 : SELECT a x + b x FROM R;Q2 : SELECT σ (a x + b x ) FROM R; where x and x are 4-byte floating point values. The numberof entries in the input array is 2 . σ is the sigmoid function(i.e., σ ( x ) = + e − x ) which can represent the output of a logisticregression model. Note that Q Q CPU and

CPU-Opt . CPU uses a multi-threaded projection where eachthread works on a partition of the input data.

CPU-Opt extends

CPU with two extra optimizations: (1) non-temporalwrites and (2) SIMD instructions. Non-temporal writesare write instructions that bypass higher cache levels andwrite out an entire cache line to main memory without firstloading it to caches. SIMD instructions can further improveperformance. With a single AVX2 instruction, for example,a modern x86 system can add, subtract, multiply, or divide agroup of 8 4-byte floating point numbers, thereby improvingthe computation power and memory bandwidth utilization.On the GPU side, we implement a single kernel that doestwo

BlockLoad ’s to load the tiles of the respective columns,computes the projection and does a

BlockStore to store itin the result array.

Model:

Assuming the queries can saturate the memorybandwidth, the expected runtime of Q1 and Q2 is runtime = × × NB r + × NB w where N is the number of entries in the input array and B r and B w are the read and write memory bandwidth, respectively.The first term of the formula models the runtime for loadingcolumns x and x , each containing 4-byte floating pointnumbers. The second term models the runtime for writingthe result column back to memory, which also contains4-byte floating point numbers. Note that this formula worksfor both CPU and GPU, by plugging in the correspondingmemory bandwidth numbers. Performance Evaluation:

Figure 10 shows the runtimeof queries Q1 and Q2 on both CPU and GPU (shown asbars) as well as the predicted runtime based on the model(shown as dashed lines). The performance of Q1 on both CPUand GPU is memory-bandwidth bound.

CPU-Opt performsbetter than

CPU due to the increased memory bandwidthefficiency.

GPU performs substantially better than both CPUimplementations due to its much higher memory bandwidth.The ratio of runtime of

CPU-Opt to GPU is 16 .

56 which is closeto the bandwidth ratio of 16 .

2. The minor difference is becauseread bandwidth is slightly lower than write bandwidth onthe CPU and the workload has a read:write ratio of 2:1.A simple multi-threaded implementation of Q2 (i.e.,

CPU ) does not saturate memory bandwidth and is computebound. After using the SIMD instructions (i.e,

CPU-Opt ),performance improves significantly and the system is close tomemory bandwidth bound. The ratio of runtime of

CPU-Opt to GPU for Q2 is 17 .

95. This shows that even for fairly complexprojections, good implementations on modern CPUs are ableto saturate memory bandwidth. GPUs do significantly betterthan CPUs due to their high memory bandwidth, with theperformance gain equal to the bandwidth ratio. T i m e T a k e n ( i n m s ) Figure 10: Project microbenchmark for each y in R:if y > v:output[i++] = v (a) With branching for each y in R:output[i] = yi += (y > v) (b) With predication

Figure 11: Implementing selection scan

We now turn our attention to evaluating selections, alsocalled selection scans. Selection scans have re-emerged formain-memory query execution and are replacing traditionunclustered indexes in modern OLAP DBMS [33]. We usethe following micro-benchmark to evaluate selections:

Q3 : SELECT y FROM R WHERE y < v; where y and v are both 4-byte floating point values. The sizeof input array is 2 . We vary the selectivity of the predicatefrom 0 to 1 in steps of 0.1.To evaluate the above query on a multi-core CPU, we usethe CPU implementation described earlier in Section 3.2.We evaluate three variants. The “naive” branching imple-mentation ( CPU If ) implements the selection using anif-statement, as shown in Figure 15(a). The main problemwith the branching implementation is the penalty for branchmispredictions. If the selectivity of the condition is neithertoo high nor too low, the CPU branch predictor is unableto predict the branch outcome. This leads to pipeline stallsthat hinder performance. Previous work has shown thatthe branch misprediction penalty can be avoided by usingbranch-free predication technique [34]. Figure 15(b) illustratesthe predication approach. Predication transforms the branch(control dependency) into a data dependency.

CPU Pred implements selection scan with predication. More recently,vectorized selection scans have been shown to improve on

CPU Pred by using selective stores to buffer entries that satisfyselection predicates and writing out entries using streamingstores [31].

CPU SIMDPred implements this approach. . . . . . . Selectivity T i m e T a k e n ( i n m s ) CPU IfCPU PredCPU SIMDPredCPU Model GPU IfGPU PredGPU Model

Figure 12: Select Microbenchmark

On the GPU, the query is implemented as a single kernel asdescribed in Section 3.2 and as shown in Figure 4(b). We imple-ment two variants:

GPU If implements the selection using anif-statement and

GPU Pred implements it using predication.

Model:

The entire input array is read and only the matchedentries are written to the output array. Assuming theimplementations can write out the matched entries efficientlyand saturate memory bandwidth, the expected runtime is: runtime = × NB r + × σ × NB w where N is the number of entries in the input array, B r and B w are the read and write bandwidth of the respective device,and σ is the predicate selectivity. Performance Evaluation:

Figure 12 shows the runtimeof the three algorithms on CPU, two algorithms on GPU,and the performance models.

CPU Pred does better than

CPU If at all selectivities except 0 (at 0,

CPU If doesno writes). Across the range,

CPU SIMDPred does betterthan the two scalar implementations. On GPU, there is noperformance difference between

GPU Pred and

GPU If — Asingle branch misprediction does not impact performanceon the GPU. Both

CPU SIMDPred and

GPU If/Pred closelytrack their respective theoretical models which assumesaturation of memory bandwidth. The average runtime ratioof CPU-to-GPU is 15 . .

2. This shows that with efficient implementations, CPUimplementations saturate memory bandwidth for selectionsand the gain of GPU over CPU is equal to the bandwidth ratio.

Hash join is the most popular algorithm used for executingjoins in a database. Hash joins have been extensively studiedin the database literature, with many different hash join algo-rithms proposed for both CPUs and GPUs [9–11, 14, 18, 23].The most commonly used hash join algorithm is the nopartitioning join, which uses a non-partitioned global hashtable. The algorithm consists of two phases: in the build phase ,the tuples in one relation (typically the smaller relation) aresed to populate the hash table in parallel; in the probe phase ,the tuples in the other relation are used to probe the hashtable for matches in parallel. For our microbenchmark, wefocus on the following join query:

Q4 : SELECT SUM (A .v + B.v) AS checksumFROM A ,B WHERE A.k = B.k where each table A and B consists of two 4-byte integercolumns k , v . The two tables are joined on key k . We keep thesize of the probe table fixed at 256 million tuples, totaling 2GB of raw data. We use a hash table with 50% fill rate. Wevary the size of the build table such that it produces a hashtable of the desired size in the experiment. We vary the sizeof the hash table from 8KB to 1GB. The microbenchmark isthe same as what past works use [9–11, 37].In this section, we mainly focus on the probe phase whichforms the majority of the total runtime. We discuss briefly thedifference in execution with respect to build time at the end ofthe section. There are many hash table variants, in this sectionwe focus on linear probing due to its simplicity and regularmemory access pattern; our conclusions, however, should ap-ply equally well to other probing approaches. Linear probing isan open addressing scheme that, to either insert an entry or ter-minate the search, traverses the table linearly until an emptybucket is found. The hash table is simply an array of slots witheach slot containing a key and a payload but no pointers.On the CPU side, we implemented three variants of linearprobing. (1) CPU Scalar implements a scalar tuple-at-a-timejoin. The probing table is partitioned equally among thethreads. Each thread iterates over its entries and for each entryprobes the hash table to find a matching entry. On findinga match, it adds A . v + B . v to its local sum. At the end, we addthe local sum to the global sum using atomic instructions.(2) CPU SIMD implements vertical vectorized probing in ahash table [31]. The key idea in vertical vectorization is toprocess a different key per SIMD lane and use gathers toaccess the hash table. Assuming W vector lanes, we process W different input keys on each loop iteration. In every round,for the set of keys that have found their matches, we calculatetheir sum, add it to a local sum, and reload those SIMDlanes with new keys. (3) Finally, CPU Prefetch adds groupprefetching to

CPU Scalar [14]. For each loop iteration,software prefetching instructions are inserted to load thehash table entry that will be accessed a certain number ofloop iterations ahead. The goal is to better hide memorylatency at the cost of increased number of instructions.On the GPU side, we implemented the join as follows. Weload in a tile of keys and payloads from the probe side using

BlockLoad ; the threads iterate over each tile independentlyto find matching entries from the hash table. Each threadmaintains a local sum of entries processed. After processingall entries in a tile, we use

BlockAgg to aggregate the local sums within a thread block into a single value and incrementa global sum with it.

Model:

The probe phase involves making random accessesto the hash table to find the matching tuple from the buildside. Every random access to memory ends up reading anentire cache line. However, if the size of hash table is smallenough such that it can be cached, then random accessesno longer hit main memory and performance improvessignificantly. We model the runtime as follows:1) If the hash table size is smaller than the size of the K th level cache, we expect the runtime to be: runtime = max ( × ×| P | B r , ( − π K − )( | P |× CB K )) where | P | is the cardinality of the probe table, B r is the readbandwidth from device memory, C is the cache line sizeaccessed on probe, B K is the bandwidth of level K cache inwhich hash table fits and π K − is the probability of an accesshitting a K − K cache is S K and size of the hash table is H, we define cache hit ratio π K = min ( S K / H , ) . The total runtime will be bounded byeither the device memory bandwidth or the cache bandwidth.Hence, the runtime is the maximum of the two terms.2) If the hash table size is larger than the size of the last levelcache, we expect the runtime to be: runtime = × ×| P | B r + ( − π )( | P |× CB r ) where π is the probability that the accessed cache line is thelast level cache. Performance Evaluation:

Figure 13 shows the perfor-mance evaluation of different implementations of Join. BothCPU and GPU variants exhibit step increase in runtime whenthe hash table size exceeds the cache size of a particular level.On the CPU, the step increases happen when the hash tablesize exceeds 256KB (L2 cache size) and 20MB (L3 cache size).On the GPU, the step increase happens when the hash tablesize exceeds 6MB (L2 cache size).We see that

CPU SIMD performs worse than

CPU Scalar ,even when the hash table is cache-resident.

CPU-SIMD usesAVX2 instructions with 256-bit registers which represent 8lanes of 32-bit integers. With 8 lanes, we process 8 keys at atime. However, a single SIMD gather used to fetch matchingentries from the hash table can only fetch 4 entries at a time(as each hash table lookup returns an 8 byte slot. i.e., 4-bytekey and 4-byte value, with 4 lookups filling the entire register).As a result, for each set of 8 keys, we do 2 SIMD gathers andthen de-interleave the columns into to 8 keys and 8 values.This added overhead of extra instructions does not exist in the

KB 32KB 128KB 512KB 2MB 8MB 32MB 128MB 512MBHash Table Size T i m e T a k e n ( i n m s ) CPU SIMDCPU PrefetchCPU ScalarCPU ModelGPUGPU Model

Figure 13: Join Performance. scalar version.

CPU SIMD is also brittle and not easy to extendto cases where hash table slot size is larger than 8 bytes.Note that past work has evaluated vertical vectorizationwith key-only build relations which do not exhibit thisissue [27, 31]. Comparing

CPU Prefetch to CPU Scalar shows that there is limited improvement from prefetchingwhen data size is larger than the L3 cache size. When thehash table fits in cache, prefetching degrades the performsdue to added overhead of the prefetching instructions.Due the step change nature of the performance curves, theratio of the runtimes varies based on hash table size. Whenthe hash table size is between 32KB and 128KB, the hash tablefits in L2 on both CPU and GPU. In this segment, we observethat the runtime is bound by DRAM memory bandwidth onCPU and L2 cache bandwidth on the GPU. The average gainsare roughly 5 . × which is in line with the model. When thehash table size is between 1MB and 4MB, the hash table fits inthe L2 on the GPU and in the L3 cache on the CPU. The ratio ofruntimes in this segment is 14 . × which is the ratio of L2 cachebandwidth on GPU to the L3 cache bandwidth on the CPU.Finally when the hash table size is larger than 128MB, the hashtable does not fit in cache on either GPU or CPU. The granu-larity of reads from global memory is 128B on GPU while onCPU it is 64B. Hence, random accesses into the hash table readtwice the data on GPU compared to CPU. Given the bandwidthratio is 16 . . .

5x due to memory stalls. The fact that actual CPU resultsare slower than

CPU Model is because the model assumesmaximum main memory bandwidth, which is not achievableas the hash table causes random memory access patterns.

Discussion:

The runtime of the build phase in the mi-crobenchmark shows a linear increase with size of the buildrelation. The build phase runtimes are less affected by cachesas writes to hash table end up going to memory.In this section, we modeled and evaluated the no parti-tioning join. Another variant of hash join is the partitionedhash join. Partitioned hash joins use a partitioning routinelike radix partitioning to partition the input relations into cache-sized chunks and in the second step run the join on thecorresponding partitions. Efficient radix-based hash join algo-rithms ( radix join ) have been proposed for CPUs [9–11, 14] andfor the GPUs [35, 38]. Radix join requires the entire input to beavailable before the join starts and as a result intermediate joinresults cannot be pipelined. Hence, while radix join is faster fora single join, radix joins are not used for queries with multiplejoins. While we do not explicitly model/evaluate radix joins, inthe next section we discuss the radix partitioning routine thatis the key component of such joins. That discussion showsthat a careful radix partition implementation on both GPUand CPU are memory bandwidth bound, and hence the per-formance difference is roughly equal to the bandwidth ratio.

In this section, we evaluate the performance of sorting 32-bitkey and 32-bit value arrays based on the key. Accordingto literature, the fastest sort algorithm for this workloadis the radix sort algorithm. We start by describing theLeast-Significant-Bit (LSB) radix sort on the CPU [32] and onthe GPU [28]. LSB radix sort is the fastest for the workload onthe CPU. We describe why the LSB radix sort does poorly incomparison on the GPU and why an alternate version calledMost-Significant-Bit (MSB) radix sort does better on theGPU [42]. We present a model for the runtime of the radix sortalgorithm and then analyze the performance characteristicsof radix partitioning on CPU vs GPU. Our implementationsare primarily based on previous work but this is first timethat these algorithm are compared to each other.The LSB radix sort internally comprises a sequence ofradix partition passes. Given an array A , radix r , and start bit a , a radix partition pass partitions the elements of the inputarray A into a contiguous array of 2 r output partitions basedon value of r-bits e [ a : a + r ) (i.e., radix) of the key e . Both onthe CPU and GPU, radix partitioning involves two phases. Inthe first phase ( histogram phase ), each thread (CPU) / threadblock (GPU) computes a histogram over its entries to findthe number of entries in each partition of the 2 r partitions. Inthe second phase ( data shuffling phase ), each thread (CPU) /thread block (GPU) maintains an array of pointers initializedby the prefix sum over the histogram and writes entries tothe right partition based on these offsets. The entire sortingalgorithm contains multiple radix partition passes, with eachpass looking at a disjoint sets of bits of the key e starting fromthe lowest bits e [ r ) to highest bits e [ k − r : k ) (where k isthe bit-length of the key).On the CPU, we use the implementation of Polychroniouet al. [32]. In the histogram phase, each thread makes onepass over its partition, for each entry calculating its radixvalue and incrementing the count in the histogram (storedin the L1 cache). For the shuffle phase, we first compute arefix sum over the histograms of all the threads (a 2D arrayof dimension 2 r × t where t is the number of threads) to findthe partition offsets for each of the threads. Next, each threadmakes a pass over its partition using gathers and scatters toincrement the counts in its offset array and writing to rightoffsets in output array. The implementation makes a numberof optimizations to achieve good performance. Interestedreader can refer to [32] for more details.On the GPU, we implemented LSB radix sort based onthe work of Merrill et al. [28]. In the histogram phase, eachthread block loads a tile, computes a histogram that countsthe number of entries in each partition, and writes it outto global memory. Prefix sum is used to find the partitionoffsets for each thread block in the output array. Next, inthe shuffling phase each thread block reads in its offsetarray. The radix partitioning pass described above need todo stable partitioning i.e., ensures that for two entries withthe same radix, the one occurring earlier in the input arrayalso occurs earlier in the output array. Now on the GPU,in order to ensure stable partitioning for LSB radix sort weneed to internally generate an offsets array for each threadfrom the the thread block offset array. For an r-bit radixpartitioning, we need 2 r size histogram per thread. A numberof optimizations have been proposed to store the histogramefficiently in registers, details of which are described in [28].Due to restriction on number of registers available per thread,stable radix partitioning pass can only process 7-bits at a time.Recently, Stehle et al. [42] presented an MSB radix sortingalgorithm for the GPU. The MSB radix sort does not requirestable partitioning. As a result, in the shuffle phase, we canjust maintain a single offset array of size 2 r for the entirethread block. This allows MSB radix sort to process up to8-bits at a time. Hence, the MSB radix sort to sort array of32-bit keys with 4 passes each processing 8-bits at a time. Onthe other hand, LSB radix sort can processes only 7-bits at atime, and hence needs 5 radix partitioning passes processing6,6,6,7,7 bits each. Model:

In the histogram phase, we read in the key columnand write out a tiny histogram. The expected runtime is: runtime histogram = × RB r where R is the size of the input array and B r is the readbandwidth. In the shuffle phase, we read both the key andpayload column and at the end write out the radix partitionedkey and payload columns. If the step is memory bandwidthbound, the runtime is expected to be: runtime shuffle = × × RB r + × × RB w where B w is the write bandwidth. Performance Evaluation:

We evaluate the performance ofhistogram and shuffle phase of the three variants:

CPU Stable T i m e T a k e n ( i n m s ) CPU StableCPU Model GPU StableGPU Unstable GPU Model (a) Radix histogram on CPU and GPU T i m e T a k e n ( i n m s ) (b) Radix shuffling on CPU and GPU Figure 14: Sort Microbenchmark for each y in R:if y > v:output[i++] = v (a) With branching for each y in R:output[i] = yi += (y > v) (b) With predication

Figure 15: Implementing selection scan (stable partitioning on CPU),

GPU Stable (stable partitioningon GPU), and

GPU Unstable (unstable partitioning on GPU).We set the size of the input arrays at 256 million entries andvary the number of radix bits we partition on. Figure 14ashows the results for the histogram phase. Note that inthe histogram phase there is no difference between

GPUStable and

GPU Unstable . The histogram pass is memorybandwidth bound on both the CPU and GPU. Figure 14bshows the results for the shuffle phase.

GPU Stable is able topartition up to 7-bits at a time whereas

GPU Unstable is ableto partition 8-bits at a time.

CPU Stable is able to partition upto 8-bits a time while remaining bandwidth bound. Beyond8-bits, the size of the partition buffers needed exceeds thesize of L1 cache and the performance starts to deteriorate.Now that we have the radix partitioning passes, we lookat the sort runtime. On the CPU, we use stable partitioningto implement LSB radix sort. It ends up running 4 radixpartitioning passes each looking at 8-bits at time. On theGPU, MSB radix sort also sorts the data with 4 passes eachprocessing 8-bits at a time. The time taken to sort 2 entriesis 464 ms on the CPU and 27 . ms on the GPU. The runtimegain is 17 . × which is close to the bandwidth ratio of 16 . × . WORKLOAD EVALUATION

Now that we have a good understanding of how individualoperators behave on both CPU and GPU, we will evaluatethe performance of a workload of full SQL queries on bothhardware platforms. We first describe the query workloadwe use in our evaluation. We then present a high-levelcomparison of the performance of queries running on GPUimplemented with the tile-based execution model versusour own equivalent implementation of the queries on theCPU. We also report the performance of Hyper [29] on CPUand Omnisci [6] on the GPU which are both state-of-the-artimplementations. As a case study, we provide a detailedperformance breakdown of one of the queries to explainthe performance gains. Finally, we present a dollar-costcomparison of running queries on CPU and GPU.We use two platforms for our evaluation. For experimentsrun on the CPU, we use a machine with a single socketSkylake-class Intel i7-6900 CPU with 8 cores that supportsAVX2 256-bit SIMD instructions. For experiments run on theGPU, we use an instance which contains an Nvidia V100 GPU.We measured the bidirectional PCIe transfer bandwidth tobe 12.8GBps. More details of the two instances are shown inTable 2. Each system is running on Ubuntu 16.04 and the GPUinstance has CUDA 10.0. In our evaluation, we ensure thatdata is already loaded into the respective device’s memorybefore experiments start. We run each experiment 3 timesand report the average measured execution time.

For the full query evaluation, we use the Star Schema Bench-mark (SSB) [30] which has been widely used in various dataanalytics research studies [15, 24, 44, 48]. SSB is a simplifiedversion of the more popular TPC-H benchmark. It has onefact table lineorder and four dimension tables date, supplier,customer, part which are organized in a star schema fashion.There are a total of 13 queries in the benchmark, divided into4 query flights. In our experiments we run the benchmarkwith a scale factor of 20 which will generate the fact tablewith 120 million tuples. The total dataset size is around 13GB.

In this section, we compare the query runtimes of benchmarkqueries implemented using block-wide functions on the GPU(

Standalone GPU ) to an equivalent efficient implementationof the query on the CPU (

Standalone CPU ). We also compareagainst Hyper (

Hyper ), a state-of-the-art OLAP DBMS andOmnisci (

Omnisci ), a commercial GPU-based OLAP DBMS.In order to ensure a fair comparison across systems, wedictionary encode the string columns into integers prior todata loading and manually rewrite the queries to directlyreference the dictionary-encoded value. For example, aquery with predicate s_region = ‘ASIA’ is rewritten with

Platform CPU GPUModel Intel i7-6900 Nvidia V100Cores 8 (16 with SMT) 5000Memory Capacity 64 GB 32 GBL1 Size 32KB/Core 16KB/SML2 Size 256KB/Core 6MB (Total)L3 Size 20MB (Total) -Read Bandwidth 53GBps 880GBpsWrite Bandwidth 55GBps 880GBpsL1 Bandwidth - 10.7TBpsL2 Bandwidth - 2.2TBpsL3 Bandwidth 157GBps -

Table 2: Hardware Specifications predicate s_region = 2 where is the dictionary-encodedvalue of ‘ASIA’ . Some columns have a small number ofdistinct values and can be represented/encoded with 1-2 bytevalues. However, in our benchmark we make sure all columnentries are 4-byte values to ensure ease of comparison withother systems and avoid implementation artifacts. Our goalis to understand the nature of the performance gains ofequivalent implementations on GPU and CPU, and not toachieve best storage layout. We store the data in columnarformat with each column represented as an array of 4-bytevalues. On the GPU, we use a thread block size of 256 with tilesize of 2056 (= 8 × Standalone CPU to Hyper shows that the former does on an average 1 . Hyper is missing vectoriza-tion opportunities and using a different implementation ofhash tables. The comparison shows that our implementationis a fair comparison and it is quite competitive compared toa state-of-the-art OLAP DBMS. We also compared againstMonetDB [12], a popular baseline for many of the past workson GPU-based databases. We found that the

StandaloneCPU is on an average 2 . × faster than MonetDB. We did notinclude it in the figure as it made the graph hard to read. Wealso tried to compare against Pelaton with relaxed-operatorfusion [27]. We found that the system could not load the scalefactor 20 dataset. Scaling down to scale factor 10, its querieswere significantly slower ( > × ) than Hyper or our approach.Comparing Standalone GPU to Omnisci , we see thatour GPU implementation does significantly better than

Omnisci with an average improvement of around 16 × . Bothmethods run with the entire working set stored on the GPU. Omnisci treats each GPU thread as an independent unit. Asa result, it does not realize benefits of blocked loading andbetter GPU utilization got from using the tile-based model.The comparison of

Standalone GPU against

Omnisci and

Standalone CPU to Hyper serve as a sanity check and showthat our query implementations are quite competitive.Comparing

Standalone GPU to Standalone CPU , we seethat the

Standalone GPU is on average 25 × faster than theCPU implementation. This is higher than the bandwidth ratio T i m e T a k e n ( i n m s ) Hyper (CPU)Standalone (CPU)Omnisci (GPU)Standalone (GPU)

Figure 16: Star Schema Benchmark Queries

SELECT SUM(lo_revenue) AS revenue, d_year, p_brandFROM lineorder, date, part, supplierWHERE lo_orderdate = d_datekeyAND lo_partkey = p_partkey AND lo_suppkey = s_suppkeyAND p_category = 'MFGR

Figure 17: SSB Query 2.1 of 16 .

2. This is surprising given that in Section 4 we saw thatindividual query operators had a performance gain equal toor lower than the bandwidth ratio. The key reason for theperformance gain being higher than the bandwidth ratio isthe better latency hiding capability of GPUs. To get a bettersense for the runtime difference, in the next subsection wediscuss models for the full SQL queries and dive into whyarchitecture differences leads to significant difference inperformance gain from the bandwidth ratio.

The queries in the Star Schema Benchmark can be brokeninto two sets: 1) the query flight q . x consists of querieswith selections directly on the fact table with no joins and2) the query flights q . x , q . x , q . x consist of queries withno selections on fact table and multiple joins — some ofwhich are selective. In this section, we analyze the behaviorq2.1 in detail as a case study. Specifically, we build a modelassuming the query is memory-bandwidth bound, derive theexpected runtime based on the model, compare them againstthe observed runtime, and explain the differences observed.Figure 17 shows the query: it joins the fact table lineorder with 3 dimension tables: supplier , part , and date . Theselectivity of predicates on p_category and s_region are1 /

25 and 1 / part and supplier have the same selectivity. We choose a query planwhere lineorder first joins supplier , then part , and finally date , this plan delivers the highest performance among theseveral promising plans that we have evaluated.The cardinalities of the tables lineorder , supplier , part ,and date are 120 M , 40 k , 1 M , and 2 . k respectively. Thequery runs build phase for each of the 3 joins to build theirrespective hash tables. Then a final probe phase runs thejoins pipelined. Given the small size of the dimension tables, the build time is much smaller than the probe time, hence wefocus on modeling the probe time. On the GPU, each threadblock processes a partition of the fact table, doing each of the3 joins sequentially and updating a global hash table at theend that maintains the aggregate. Past work [26] has shownthat L2 cache on the GPU is an LRU set associative cache.Since hash tables associated with the supplier and date table are small, we can assume that they remain in the L2cache. The size of the part hash table is larger than L2 cache.We model the runtime as consisting of 3 components:1) The time taken to access the columns of the fact table: r = ( | L | C + min ( | L | C , | L | σ ) + min ( | L | C , | L | σ σ ) + min ( | L | C , | L | σ σ ))× CB r where σ and σ are join selectivities associated with join with supplier and part tables respectively, | L | is the cardinalityof the lineorder table, C is size of cache line, and B r is theglobal memory read bandwidth. For each column except thefirst, the number of cache lines accessed is the minimum of: 1)accessing all cache lines of the column ( | L | C ) and 2) accessinga cache line per entry read ( | L | σ ).2) Time taken to probe the join hash tables: r = ( ×| S | + ×| D | + ( − π )(| L | σ ))× CB r where | S | and | D | are cardinalities of the supplier and date table, (| L | σ ) represents the number of lookups into the part hash table and π is the probability of finding the part hash table lookup in the L2 cache.3) Time taken to read and write to the result table: r = | L | σ σ × CB r + | L | σ σ × CB w The total runtime on GPU is r + r + r . The key differencewith respect to CPU is that on the CPU, all three hash tables fitin the L3 cache. Hence for CPU, we would have r = ( ×| S | + ×| D | + ×| P |) . To calculate π , we observe that the size of the part hash table (with perfect hashing) is 2 × × M = MB .With the supplier and date table in cache, the availableache space is 5 . part lookupin L2 cache is π = . /

8. Plugging in the values we get theexpected runtimes on the CPU and GPU as 47 ms and 3.7 ms re-spectively compared to actual runtime of 125 ms and 3.86 ms.We see that the model predicted runtime on the GPU is closeto the actual runtime whereas on the CPU, the actual runtimeis higher than the modeled runtime. This is in large part be-cause of the ability of GPUs to hide memory latency even withirregular accesses. SIMT GPUs run scalar code, but they “tie”all the threads in a warp to execute the same instruction in a cy-cle. For instance, gathers and scatter are written as scalar loadsand stores to non-contiguous locations. In a way, CPU threadsare similar to GPU warps and GPU threads are similar to SIMDlanes. A key difference between SIMT model on GPU vs SIMDmodel on CPU is what happens on memory access. On theCPU, if a thread makes a memory access, the thread waits forthe memory fetch to return. If the cache line being fetched isnot in cache, it leads to a memory stall. CPU have prefetchersto remedy this, but prefetchers do not work well with irregularaccess patterns like join probes. On the GPU, a single stream-ing multiprocessor (SM) usually has 64 cores that can execute2 warps (64 threads) at any point. However, the SM can keep > The paper has so far demonstrated that GPUs can have su-perior performance than CPUs for data analytics. However,GPUs are known to be more expensive than CPUs in termsof cost. Table 3 shows both the purchase and renting cost ofCPU and GPU that match the hardware used in this paper (i.e.,Table 2). For renting costs, we use the cost of EC2 instancesprovided by Amazon Web Services (AWS). For CPU, we choosethe instance type r5.2xlarge which contains a modern Sky-lake CPU with 8 cores, with a cost of $0 .

504 per hour. For GPU,we choose the instance type p3.2xlarge whose specs are sim-ilar to r5.2xlarge plus it has an Nvidia V100 GPU, with a costof $3 .

06 per hour. The cost ratio of the two systems is about6 × . For purchase costs, we compare the estimate of a singlesocket server blade to the same server blade with one NvidiaV100 GPU. The cost ratio of the two systems at the high end isless than 6 × . The average performance gap, however, is about Purchase Cost Renting CostCPU $2-5K $0.504 per hourGPU $CPU + 8.5K $3.06 per hour Table 3: Purchase and renting cost of CPU and GPU. × according to our evaluation (cf. Section 5.2), which leadsto a factor of 4 improvement in cost effectiveness of GPU overCPU. Although the performance and cost will vary a lot acrossdifferent CPU and GPU technologies, the ratio between thetwo will not change as much. Therefore, we believe the anal-ysis above should largely apply to other hardware selection. In this paper, we showed through our model-based analysisand empirical evaluation that there is limited gain fromusing GPUs as a coprocessor and that the runtime gain fromrunning queries on the GPU vs CPU is 1 . x the bandwidthratio of the two devices. We believe that these results shouldhelp pivot the community towards treating GPUs as primaryexecution engine. However, this paper largely focused onusing a single GPU, which has limited memory capacity.There are many challenges that need to be addressed beforeGPUs have widespread adoption that were beyond the scopeof this paper and make for exciting future work: • Distributed+Hybrid

It is possible to attach multipleGPUs onto a single machine that can greatly increase theaggregated HBM memory capacity. These machines willalso having significant CPU memory. Executing querieson this heterogeneous system is still an open problem. • Compression

Data compression could be used to fit moredata into GPU’s memory. GPUs have higher computeto bandwidth ratio than CPUs which could allow use ofnon-byte addressable packing schemes. • Strings/Non-Scalar Data Types

Handling arbitrarystrings and array data types efficiently on GPUs is still anopen problem.

This paper compared CPUs and GPUs on database analyticsworkloads. We demonstrated that running an entire SQLquery on a GPU delivers better performance than usingthe GPU as an accelerator. To ease implementation of high-performance SQL queries on GPUs, we developed Crystal, alibrary supporting a tile-based execution model. Our analysison SSB, a popular analytics benchmark, shows that modernGPUs are 25 × faster and 4 × more cost effective than CPUs.This makes a strong case for using GPUs as the primaryexecution engine when the dataset fits into GPU memory. EFERENCES

Proceedings of the VLDBEndowment , 7(1):85–96, 2013.[10] C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hashjoins on multi-core cpus: Tuning to the underlying hardware. In

DataEngineering (ICDE), 2013 IEEE 29th International Conference on , pages362–373. IEEE, 2013.[11] S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memoryhash join algorithms for multi-core cpus. In

Proceedings of the 2011ACM SIGMOD International Conference on Management of data , pages37–48. ACM, 2011.[12] P. A. Boncz et al.

Monet: A next-generation DBMS kernel forquery-intensive applications . Universiteit van Amsterdam [Host], 2002.[13] P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipeliningquery execution. In

Cidr , volume 5, pages 225–237, 2005.[14] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Improving hashjoin performance through prefetching.

ACM Transactions on DatabaseSystems (TODS) , 32(3):17, 2007.[15] H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner. Pipelined query pro-cessing in coprocessor environments. In

Proceedings of the 2018 Interna-tional Conference on Management of Data , pages 1603–1618. ACM, 2018.[16] N. Govindaraju et al. Gputerasort: high performance graphicsco-processor sorting for large database management. In

SIGMOD , 2006.[17] M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix sum (scan) withcuda.

GPU gems , 3(39):851–876, 2007.[18] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander.Relational joins on graphics processors. In

Proceedings of the 2008ACM SIGMOD international conference on Management of data , pages511–524, 2008.[19] J. He, M. Lu, and B. He. Revisiting co-processing for hash joins on thecoupled cpu-gpu architecture.

PVLDB , 2013.[20] M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores.

PVLDB , 2013.[21] J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performancecode generation for stencil computations on gpu architectures. In

Proceedings of the 26th ACM international conference on Supercomputing ,pages 311–320. ACM, 2012.[22] T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processingrevisited. In

DaMoN , 2012.[23] H. Lang, V. Leis, M.-C. Albutiu, T. Neumann, and A. Kemper. Massivelyparallel numa-aware hash joins. In

In Memory Data Management andAnalysis , pages 3–14. Springer, 2015.[24] J. Li, H.-W. Tseng, C. Lin, Y. Papakonstantinou, and S. Swanson.Hippogriffdb: Balancing i/o and gpu bandwidth in big data analytics.

Proceedings of the VLDB Endowment , 9(14):1647–1658, 2016.[25] S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing databasearchitecture for the new bottleneck: memory access.

Proceedings ofthe VLDB Endowment , 9(3):231–246, 2000. [26] X. Mei and X. Chu. Dissecting gpu memory hierarchy throughmicrobenchmarking.

IEEE Transactions on Parallel and DistributedSystems , 2016.[27] P. Menon, T. C. Mowry, and A. Pavlo. Relaxed operator fusionfor in-memory databases: Making compilation, vectorization, andprefetching work together at last.

Proceedings of the VLDB Endowment ,11(1):1–13, 2017.[28] D. Merrill and A. Grimshaw. High performance and scalable radixsorting: A case study of implementing dynamic parallelism for gpucomputing.

Parallel Processing Letters , 21(02):245–272, 2011.[29] T. Neumann. Efficiently compiling efficient query plans for modernhardware.

Proceedings of the VLDB Endowment , 4(9):539–550, 2011.[30] P. O’Neil, E. O’Neil, X. Chen, and S. Revilak. The star schema benchmarkand augmented fact table indexing. In

Technology Conference on Per-formance Evaluation and Benchmarking , pages 237–252. Springer, 2009.[31] O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking simdvectorization for in-memory databases. In

Proceedings of the 2015ACM SIGMOD International Conference on Management of Data , pages1493–1508. ACM, 2015.[32] O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison-andradix-sort. In

Proceedings of the 2014 ACM SIGMOD internationalconference on Management of data . ACM, 2014.[33] V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. Ku-landaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, et al.Db2 with blu acceleration: So much more than just a column store.

Proceedings of the VLDB Endowment , 6(11):1080–1091, 2013.[34] K. A. Ross. Selection conditions in main memory.

ACM Transactionson Database Systems (TODS) , 29(1):132–161, 2004.[35] R. Rui and Y.-C. Tu. Fast equi-join algorithms on gpus: Design andimplementation. In

Proceedings of the 29th International Conferenceon Scientific and Statistical Database Management , page 17. ACM, 2017.[36] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, andP. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivioussimd sort. In

Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data , pages 351–362. ACM, 2010.[37] S. Schuh, X. Chen, and J. Dittrich. An experimental comparison ofthirteen relational equi-joins in main memory. In

Proceedings of the2016 International Conference on Management of Data , pages 1961–1976.ACM, 2016.[38] P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ail-amaki. Hardware-conscious hash-joins on gpus. Technical report, 2019.[39] E. A. Sitaridi and K. A. Ross. Ameliorating memory contention of olapoperators on gpu processors. In

Proceedings of the Eighth InternationalWorkshop on Data Management on New Hardware , pages 39–47. ACM,2012.[40] E. A. Sitaridi and K. A. Ross. Optimizing select conditions on gpus. In

Proceedings of the Ninth International Workshop on Data Managementon New Hardware , page 4. ACM, 2013.[41] J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilationin query execution. In

Proceedings of the Seventh International Workshopon Data Management on New Hardware . ACM, 2011.[42] E. Stehle and H.-A. Jacobsen. A memory bandwidth-efficient hybridradix sort on gpus. In

SIGMOD . ACM, 2017.[43] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,M. Ferreira, E. Lau, A. Lin, S. Madden, E. O’Neil, et al. C-store: a column-oriented dbms. In

Proceedings of the 31st international conference onVery large data bases , pages 553–564. VLDB Endowment, 2005.[44] K. Wang, K. Zhang, Y. Yuan, S. Ma, R. Lee, X. Ding, and X. Zhang.Concurrent analytical query processing with gpus.

Proceedings of theVLDB Endowment , 7(11):1011–1022, 2014.45] J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In

Eu-ropean Conference on Parallel Processing , pages 160–169. Springer, 2011.[46] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver:Automatically fusing database primitives for efficient gpu compu-tation. In . IEEE, 2012. [47] M. Yabuta, A. Nguyen, S. Kato, M. Edahiro, and H. Kawashima.Relational joins on gpus: A closer look.

IEEE Transactions on Paralleland Distributed Systems , 28(9):2663–2673, 2017.[48] Y. Yuan, R. Lee, and X. Zhang. The yin and yang of processing datawarehousing queries on gpu devices.