[PDF] Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

Abstract

High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth. In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems on graphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performed within on-chip shared memory every cycle. A domain decomposition style procedure is adopted in which the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory, avoiding the need to perform expensive global memory accesses every step. We test our algorithm on the linear systems arising from discretization of Poisson's equation in 1D and 2D, and observe speedup in convergence using our shared memory approach compared to a traditional Jacobi implementation which only uses global memory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedup in the 2D case from the use of shared memory compared to a conventional GPU approach.

Full PDF

HHierarchical Jacobi Iteration for Structured Matrices on GPUsusing Shared Memory

Mohammad Shafaet Islam ∗ ,a and Qiqi Wang aa Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge,MA 02139, USA

Abstract

High ﬁdelity scientiﬁc simulations modeling physical phenomena typically require solving large linearsystems of equations which result from discretization of a partial diﬀerential equation (PDE) by somenumerical method. This step often takes a vast amount of computational time to complete, and thereforepresents a bottleneck in simulation work. Solving these linear systems eﬃciently requires the use of mas-sively parallel hardware with high computational throughput, as well as the development of algorithmswhich respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth.In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems ongraphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performedwithin on-chip shared memory every cycle. A domain decomposition style procedure is adopted in whichthe problem domain is partitioned into subdomains whose data is copied to the shared memory of eachGPU block. Jacobi iterations are performed internally within each block’s shared memory, avoiding theneed to perform expensive global memory accesses every step. We test our algorithm on the linear systemsarising from discretization of Poisson’s equation in 1D and 2D, and observe speedup in convergence usingour shared memory approach compared to a traditional Jacobi implementation which only uses globalmemory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedupin the 2D case from the use of shared memory compared to a conventional GPU approach.

Graphics processing units (GPUs) have emerged as a popular hardware architecture for scientiﬁc computing.While originally introduced for real-time graphics rendering in the gaming industry, GPUs were found to oﬀerthe computational horsepower and high memory bandwidth necessary to accelerate large scale computationswhich have traditionally been performed using CPUs. GPU computing has permeated many ﬁelds such asﬁnance [1], computational chemistry [2] and ﬂuid dynamics [3].Simulation work often requires approximating the solution to a partial diﬀerential equation (PDE) gov-erning some physical phenomenon in an eﬃcient manner. This usually involves applying some numericalmethod to reduce the PDE into a set of linear equations. Solving the resulting linear system is often thebottleneck of many simulations. Improving the speed of linear solvers is essential to improving the value ofsimulation work. For this reason, there has been a tremendous eﬀort to develop eﬃcient GPU implemen-tations of linear solver algorithms. As some examples, conjugate gradient solvers for sparse unstructuredmatrices and a multigrid solver for regular grids using GPU textures were developed by Bolz et al. [4].To improve the performance of preconditioned conjugate gradient and GMRES, Li and Saad developed aneﬃcient sparse matrix vector product on the GPU which exhibited up to eight times speedup relative to aparallel CPU implementation on several matrices from the SuiteSparse matrix collection [5]. ∗ Corresponding author

Email addresses : [email protected] (Mohammad Shafaet Islam), [email protected] (Qiqi Wang) a r X i v : . [ c s . M S ] J un acobi iteration has been a popular iterative method of study on the GPU. While slow to convergecompared to Krylov methods, Jacobi iteration is completely parallel making it quite amenable to GPUimplementation [6]. Furthermore, Jacobi iteration and other similar stationary iterative methods form thebackbone of highly eﬀective geometric and algebraic multigrid methods [7]. A number of studies have beenconducted to assess the performance of Jacobi iteration on the GPU. Zhang et al. study the performanceof a CPU and CUDA based GPU implementation of Jacobi iteration on a dense matrix. They observe aspeedup of 19 times in double precision (59 times in ﬂoating point precision) for a matrix size of 7680 (whenall GPU threads are operating) compared to the CPU performance [8]. Wang et al. perform studies of Jacobiiteration on sparse matrices on the GPU. Their implementation uses shared memory to store the solutionat each step, and they observe a speedup factor of up to 100 times relative to their CPU implementation[9]. Amorim et al. apply a weighted version of Jacobi iteration to a structured matrix resulting from aﬁnite element discretization of a set of PDEs from cardiac modeling. They compare the performance of anOpenGL implementation to several CUDA implementations with global coalesced memory accesses, sharedmemory and texture memory, and ﬁnd that a CUDA implementation with texture memory yields the bestperformance [10]. Jacobi iteration on a 2D Laplace equation is studied by Cecilia et al. They measure thetime to perform a set number of iterations in CUDA using several code optimizations which involve theuse of shared memory, and observe a three to four times speedup for various domain sizes compared totheir unoptimized CUDA implementation (and even higher speedups relative to their CPU implementation)[11]. Several multi-CPU, multi-GPU and hybrid CPU/GPU implementations of Jacobi iteration on the 2DPoisson equation in a structured domain are developed by Venkatasubramanian and Vuduc. Their bestimplementation is able to achieve 98% GPU bandwidth. To further improve performance, they developasynchronous implementations which remove the need for complete synchronization each step, leading to a1.2 to 2.5 times speedup over their previously optimal implementation [12]. Ahamed and Magoul´es develop aKrylov-like version of Jacobi iteration and apply it to several three dimensional problems (Laplace equation,gravitational potential equation, heat equation) on the GPU, observing a 23 times speedup using this newversion compared to the serial CPU-based algorithm [13].The studies listed above detail the application of Jacobi iteration on the GPU to a variety of problems,such as 2D structured problems and full 3D unstructured problems. In all cases, the studies highlight the needfor eﬃcient memory access for optimal performance. Memory bandwidth, not computational complexity, hasemerged as the limiting factor in scientiﬁc simulations. Therefore, it is essential that applications utilizingthe GPU respect its memory hierarchy in an eﬀort to achieve the best performance.In this work, we develop an algorithm to perform the Jacobi iterative method eﬃciently using on-chipGPU shared memory, which is quicker to access than GPU global memory. We focus speciﬁcally on structuredmatrices. Shared memory is accessible on the GPU as a cache-like memory which has much higher bandwidthcompared to global memory which is traditionally used on the GPU [14]. However, shared memory is typicallyvery small (16 kB - 48 kB based on settings and device compute capability [15]) and only exists per GPUblock. This inspires a domain decomposition style algorithm where the domain is separated into subdomainswhich are allocated to the shared memory of diﬀerent GPU blocks. Within the blocks, we perform manyiterations of Jacobi, avoiding communication between GPU blocks at every step. This results in a hierarchicalalgorithm in which every cycle, multiple Jacobi iterations are performed independently within the sharedmemory of each GPU block (termed subiterations) and copied back to global memory (which holds the mostup to date solution at the end of the cycle). Previous studies have generally focused on the performance of thestandard Jacobi iteration using shared memory. However, this hierarchical approach results in a variation ofJacobi iteration that demonstrates better performance in parallel settings where a memory hierarchy exists(such as on the GPU). We report comparisons between a classical GPU implementation of Jacobi iterationas well as this shared memory approach.The rest of our paper is structured as follows. Section 2 provides background for this work; a review of theJacobi iterative method for solving linear systems of equations and an overview of the CUDA framework forGPU computing. Section 3 introduces various approaches for performing Jacobi iteration, including our newshared memory approach. and provides performance comparisons between this approach and a traditionalGPU approach with global memory for the 1D Poisson equation. Section 4 presents a similar algorithm and2omparison results in the context of 2D problems. Lastly, Section 5 provides concluding remarks on thiswork. Jacobi iteration is an iterative method used to solve a linear system Ax = b where A ∈ R n × n and x ∈ R n , b ∈ R n . The Jacobi iterative method is an example of a stationary iterative method which involves a matrixsplitting of the form A = M − N [16]. Then the iterative update is given by: x ( n +1) = M − ( b − N x ( n ) ) (1)For Jacobi iteration, M ≡ D where D is a diagonal matrix with diagonal entries equal to those of the matrix A . Meanwhile, the matrix N ≡ A − D . Elemental Jacobi iteration can be written as x ( n +1) i = 1 a ii  b i − (cid:88) j (cid:54) = i a ij x ( n ) j  (2)In order for Jacobi iteration to converge, the spectral radius of the iteration matrix must be less than one ρ ( D − ( A − D )) < A schematic of the CUDA model is shown in Figure 1. A GPU program developed using CUDA (the parallelFigure 1: CUDA Hierarchy and Memory Model (taken from [14]).computing platform developed by NVIDIA for general purpose GPU programming) typically begins withinitialization on the CPU (termed host) and transfers to the GPU (termed device) where parallel calculationswill be performed. The most fundamental unit of parallel execution is the thread. A large number of threadscan run simultaneously. This gives GPUs high computing power and the ability to perform mathematical3perations much more quickly than CPUs. Groups of threads are organized into blocks (up to 1024 threadsper block are permitted on most modern machines). Every thread is assigned a thread ID that is unique toit in the thread block, given by threadIdx.x . For 2D and 3D problems, a threadIdx.y and threadIdx.z are also assigned (these are set to 0 by default for the 1D case). Similarly, every block has its own uniqueblock ID given by blockIdx.x (a similar blockIdx.y and blockIdx.z are assigned in 2D and 3D). Eachblock contains its own shared memory storage which is accessible to the threads comprising the block.When launching a kernel (calculation to be performed on the device), the user speciﬁes the numberof threads per block (denoted by blockDim.x ; blockDim.y and blockDim.z are also deﬁned for 2D and3D problems) as well as the total number of blocks to be invoked. The blocks are then distributed betweenstreaming multiprocessors (SMs), and the calculations are launched by warps (groups of 32 threads belongingto the same block). It is preferable to select a thread per block value which is a multiple of 32 so thatall threads in a warp are utilized. The warps are scheduled in a round-robin fashion so that latency ishidden as calculations are being performed. They continue to operate until all threads have completed theircalculations.When a block is allocated to an SM, the block receives a portion of the total shared memory on the SM.The amount of shared memory in a block is usually limited (modern GPUs typically have 48 kB of sharedmemory per block) and accessible only to threads within a block. However, shared memory is more quicklyaccessible (approximately 10 times higher bandwidth [17] and 100 times lower latency [15]) than the GPUglobal memory. Therefore, eﬃcient use of intra-block shared memory can reduce runtime signiﬁcantly. We introduce a Jacobi iterative solver which utilizes GPU shared memory for one dimensional problems.In general, linear systems arising from one dimensional partial diﬀerential equations (PDEs) can be solvedeﬃciently using direct methods (e.g. Thomas algorithm for tridiagonal systems), with iterative methodsbeing desirable when dealing with linear systems arising from discretizations of higher dimensional PDEs.However, applying our shared memory Jacobi iterative solver to the 1D case can provide insights whichtranslate to higher dimensions.We consider a linear system which arises from a 1D PDE (e.g. diﬀusion equation) on a grid as shown asin Figure 2. The interior points correspond to the diﬀerent degrees of freedom (DOFs) of a linear system.We assume Dirichlet BCs, represented by the square endpoints. One-dimensional PDEs discretized with rs bc bc bc bc bc bc bc bc bc bc bc bc rs

Figure 2: A sample one-dimensional domain. Square points indicate Dirichlet boundary conditions.standard discretizations (ﬁnite diﬀerence, ﬁnite elements) usually result in a tridiagonal system. Consider ageneral tridiagonal matrix A with the values a i , d i , and c i in the i th subdiagonal, diagonal, and superdiagonalentries, respectively (as shown below). A =  d c a d c . . . . . . . . . a n − d n − c n − a n d n  Applying the Jacobi iterative method (Equation (2)) to advance the i th DOF (denoted by x i ) in our tridiag-onal system from iteration n to iteration n + 1 results in the following update equation (where b i representsthe right hand side value at the i th grid point) x n +1 i = 1 d i ( b i − a i x ni − − c i x ni +1 ) (4)4or a particular DOF, Equation (4) shows that the solution at the next iteration depends only on the valuesat adjacent DOFs at the current iteration. This is illustrated in Figure 3. If the neighbor values are available,we can perform the update on the DOF. This idea is used to develop our various approaches for solving the1D problem on the CPU and GPU. bc bcbc x ni − x ni +1 x n +1 i Figure 3: Stencil for Jacobi iteration performed on a triadiagonal matrix. The value of a degree of freedomat the next step depends only on adjacent values at the current step.

Jacobi iteration can be implemented on the CPU as follows. Each degree of freedom in the grid can beupdated in a sequence, with the subsequent values at step n + 1 relying only on the current values at step n .This requires two storage containers, one to store the solution at step n and the other to store the solution atstep n + 1. Once all DOFs are updated in a given step, the two arrays can be swapped and the process canbe repeated. A sample C++ style code to illustrate the CPU approach is shown below. The jacobiUpdate function corresponds to the update equation in Equation (4). G i v e n x0 , b :d o u b l e ∗ x1 = new d o u b l e [N ] ;f o r ( i n t k = 0 ; k < numSteps ; k++) { f o r ( i n t i = 1 ; i < N −

1; i ++) { d o u b l e l e f t v a l u e = x0 [ i − } swap ( x0 , x1 ) ; } The GPU approach for implementing Jacobi iteration with global memory is quite similar to the CPUapproach. Rather than use a for loop, each GPU thread can independently update each DOF. We mustensure that we request enough threads to perform the updates on all DOFs. We can do this by setting anyvalue for the number of threads per block (preferably a multiple of 32 as threads operate in warps containing32 threads), and setting the number of blocks equal to the ceiling of the total number of DOFs divided bythe number of threads per block speciﬁed earlier. A sample CUDA C++ style code to illustrate the GPUapproach is shown below.

G i v e n x0 , b :d o u b l e ∗ x1 = new d o u b l e [N ] ;f o r ( i n t k = 0 ; k < numSteps ; k++) { i n t i = t h r e a d I d x . x + blockDim . x ∗ b l o c k I d x . x ; f ( i > < N − { d o u b l e l e f t v a l u e = x0 [ i − } s y n c t h r e a d s ( ) ;swap ( x0 , x1 ) ; } Shared memory on the GPU has a higher bandwidth and lower latency compared to global memory [15,17], so adapting GPU applications to use shared memory has become a popular optimization technique forimproving performance. Performing Jacobi updates within shared memory can improve performance becausereading current values to and writing updated values from shared memory will be much faster compared toperforming these accesses in global memory. However, using shared memory can be challenging because itis allocated on a per block basis. Threads in one GPU block cannot access the shared memory of anotherGPU block. This is a byproduct of on-die shared memory being very limited. Only a set number of blockscan be active at a given time and receive a shared memory allocation. In general, blocks must wait in thequeue to operate, as well as receive a shared memory allocation and perform updates. Although the limitedavailability of shared memory makes it challenging to use, it can provide speedup in many applications whichare memory bound, such as iterative solvers.One way to utilize shared memory in this application is to partition the solution data between theavailable shared memory and operate on these partitions independently. For the 1D problem, the DOFs inthe domain shown in Figure 2 can be partitioned into several subdomains. The DOFs in each subdomainmay be updated by a distinct GPU block. Figure 4 shows an example of a 12 point grid, divided into 4 pointsubdomains which will be operated on by 3 GPU blocks. rs bc bc bc bc bc bc bc bc bc bc bc bc rs

Updated by blockIdx.x = 0 Updated by blockIdx.x = 1 Updated by blockIdx.x = 2

Figure 4: 1D domain partitioned between diﬀerent GPU blocks, Each GPU block is responsible for updatingits own set of points.The ﬁrst step in our approach is to copy the DOF values in each subdomain from global memory (wherethe initial solution resides) to the shared memory of the GPU block which will be responsible for its update.Updating the edge values in each subdomain requires information from the adjacent DOFs which will notbe accessible to the block. We address this by augmenting the subdomains by an extra point to the leftand right (beyond the points that will be updated). These augmented subdomains (shown in Figure 5 forour example) can be copied to the shared memory of each GPU block. The shared memory simply holds atemporary copy of the solution which is to be operated on, while the ﬁnal solution is always stored in globalmemory.After the copy to shared memory is complete, we can perform Jacobi iteration on the interior points ofeach subdomain using the stencil update given in Equation (4). To maximize parallelism, each DOF updateshould be handled by a distinct thread. For our example, the number of threads per block should be set to 4so that 4 threads can be used to update the 4 interior points within each subdomain. In general, the threadsper block speciﬁcation is exactly equal to the number of interior points in a subdomain (i.e. the subdomainsize is directly tied to the user speciﬁed thread per block choice). Given a thread per block speciﬁcation of blockDim.x , the subdomain size will be blockDim.x + 2 with blockDim.x interior points to be updated.6 s bc bc bc bc bc bc bc bc bc bc bc bc rs

Copied to Shared Memory of blockIdx.x = 0Copied to Shared Memory of blockIdx.x = 1Copied to Shared Memory of blockIdx.x = 2

Figure 5: Copy of data from global memory (containing the entire grid) to the shared memory of each GPUblock. Each block receives data corresponding to points it is responsible for updating plus an additional leftand right point.To execute the previous global to shared memory transfer eﬃciently, each of the blockDim.x threads shouldcopy one value from global to shared memory and the remaining two values can be copied by threads 0and 1. One can alter the subdomain size by changing the thread per block speciﬁcation. For eﬃciency,the threads per block speciﬁcation should be a multiple of 32 because threads are launched in groups of 32threads (known as warps). This ensures all threads in a warp are active and avoids thread divergence. As aconsequence, subdomains with a multiple of 32 interior DOFs are preferable.Since data movement tends to be the bottleneck in iterative linear solvers (rather than computation), itis beneﬁcial to perform many updates within shared memory before executing expensive data transfers backto global memory. Following this logic, we recommend performing many Jacobi iterations within sharedmemory. We refer to the Jacobi iterations performed within shared memory as subiterations, and denotethe number of subiterations performed by k . Upon performing the k subiterations, the updated values canbe copied from the shared memory of every GPU block to their global memory position. Only the interior blockDim.x values of each subdomain must be copied back from shared to global memory since the twoedgemost points are not updated. Each thread should copy the same DOF it updated from shared to globalmemory. Shared memory is terminated at the end of this step and the updated solution now exists in globalmemory.This completes one cycle of our shared memory algorithm. The algorithm is implemented in a single GPUkernel (since shared memory is only active as long as a GPU kernel is running). A sample CUDA C++ stylecode to illustrate the implementation is provided in Appendix A. The main parameters in the algorithmare the number of threads per block ( blockDim.x ) and the number of subiterations ( k ). The number ofthreads per block controls the size of the subdomains in shared memory, while the number of subiterationscontrols the number of Jacobi iterations performed within shared memory before communication back toglobal memory. Our algorithm is depicted in Figure 6. In summary, each cycle of the algorithm involves thefollowing three steps:1. Partition the gridpoints between the diﬀerent GPU blocks and copy values from global memory toshared memory. Each subdomain size has size blockDim.x + 2.2. Update the interior blockDim.x points in each GPU block by performing k Jacobi iterations (we referto these as subiterations).3. Update the global memory solution array by copying the interior blockDim.x values from sharedmemory to their appropriate positions in global memory (shared memory is terminated at this point).An application that uses GPU shared memory requires the user to specify the number of bytes of sharedmemory needed. At the very least, there must be enough space for twice the subdomain data (which hassize blockDim.x + 2 ) because Jacobi iteration requires two storage containers (one containing the currentsolution values and one containing the updated values). During even subiterations, the ﬁrst blockDim.x +2 entries within shared memory should hold the current solution values while the updated values are placedin the second blockDim.x + 2 entries. During odd subiterations, this should be reversed with the second blockDim.x + 2 entries holding the current solution values while the updated values are placed in the ﬁrst blockDim.x + 2 entries. The blockDim.x right hand side values corresponding to the interior subdomainDOFs can also be copied to shared memory (as the right hand side value appears in the Jacobi update7 s bc bc bc bc bc bc bc bc bc bc bc bc rs

Shared Memory of blockIdx.x = 0 rs bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 1 bc bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 2 bc bc bc bc bc rs × × × × rs bc bc bc bc bc bc bc bc bc bc bc bc rs

Global MemoryGlobal Memory

Figure 6: Schematic of the shared memory algorithm for 1D Jacobi iteration. Initially, data is copied fromglobal memory to the shared memory of diﬀerent GPU blocks. Each GPU block is responsible for updatinga distinct set of points. After performing k Jacobi iterative updates (marked by X) the updated point valuesare transferred back to global memory.function in Equation (4)). Therefore, the total amount of shared memory required for the algorithm is 2times the subdomain size ( blockDim.x + 2 ) for the solution array plus an array of size blockDim.x for theright hand side values corresponding to the interior DOFs. This should be multiplied by the size of therepresentation (4 bytes for ﬂoats, or 8 bytes for doubles).One practical question to consider when applying the algorithm is what is a good number of subiterationsto perform within shared memory. Using a large value of k is beneﬁcial as we can perform many Jacobiupdates without requiring expensive global memory accesses. The drawback of setting a large k is the lackof frequent update of the boundary values of each subdomain. These boundary points are updated by theadjacent blocks, and the most up to date values are available to a subdomain only after each cycle whenall values are transferred back to global memory after the subiterations are performed and reallocated backto shared memory. Therefore, subiterations used to update interior points within a subdomain rely on oldboundary data, which can negatively impact convergence. Setting k = 1 ensures that subdomains have themost up to date boundary values at every step, and is exactly equivalent to performing Jacobi iteration.However, setting such a low k value requires many expensive global memory accesses which will inhibitalgorithm performance. There is a tradeoﬀ between receiving up to date subdomain values more frequentlyto improve convergence (by setting a small k ), and performing fewer global memory accesses to improveperformance (by setting a small k ). We explore good values for k in our numerical tests. The classic GPU and shared memory approaches for Jacobi iteration presented in Sections 3.2-3.3 are appliedto the triadiagonal system arising from a ﬁnite diﬀerence discretization of the Poisson equation on a onedimensional domain x ∈ [0 ,

1] with Dirichlet boundary conditions (shown in Equation (5)): − d udx = f, u (0) = u (1) = 0 (5)The triadiagonal system is given by Ax = b (6)8here A = 1∆ x  − − −

1. . . . . . . . . − − −  , b =   The right hand side vector b has all entries set to one. We consider a problem size of N = 1024 interiorDOFs (1026 points in total) and measure the time required to reduce the L residual norm of the initialsolution (set to a vector of ones) by a factor of 1e-4 using Jacobi iteration. The elemental Jacobi update(Equation (2)) applied to the 1D Poisson equation results in the update equation (7). x ( n +1) i = b i (∆ x ) + x i − + x i +1 k = 4 , , , , , k = 8. Past k = 16, the time to converge increases linearly with the number of subiterations,suggesting that performing too many subiterations within shared memory can hurt performance.Table 1 shows the speedup achieved going from the best classic GPU performance to the shared memoryperformance for each subiteration value speciﬁed. The best speedup achieved is nearly four times overthe classic GPU approach when k = 8 (setting k = 16 results in a nearly similar speedup). The resultsdemonstrate that the time for convergence for Jacobi iteration can be reduced using shared memory.9igure 7: Comparison of time required for classic GPU and shared memory implementations with diﬀerentsubiteration values to reduce the residual of the solution by 1e-4 for a problem size of N = 1024 (solving1024 copies simultaneously). The shared memory approach outperforms the classic approach in all casesexcept when k = 128, and performs the best when k = 8.Number of Subiterations 4

16 32 64 128Best GPU to Shared Speedup 2.45 k = 8. 10 .5 Eﬀect of Overlapping Subdomains in 1D One key feature of our shared memory algorithm is that many Jacobi updates are performed each cyclewithout transfer to global memory. While this minimizes global memory accesses to improve performance,convergence is hampered by the lack of up to date values used for the updates. In particular, boundarydata in each subdomain is only updated every cycle (after k subiterations), so many cycles may be requiredfor convergence. DOFs close to the subdomain edges are especially aﬀected and contribute a larger residualvalue to the overall residual norm. Improving the edge values can greatly improve the convergence of thealgorithm and reduce the number of cycles necessary.We attempt to alleviate this problem by using overlapping subdomains. The key idea of this approachis that DOFs which are further away from subdomain edges (closer to the subdomain center) have lesserror than those closer to the edges. With overlapping subdomains, we permit sets of points close to thesubdomain edges to be allocated to two subdomains (copied to the shared memory of two GPU blocks). ADOF which exists in two subdomains will have a diﬀerent local position in each subdomain. The subdomainin which the DOF is further from the edge is the one which will contribute the ﬁnal value to global memory.To illustrate this approach more clearly, an example is shown in Figure 6 for the 12 point grid. In thisexample, each subdomain has two interior DOFs in common with its neighbor. These overlapping DOFs areupdated in the shared memory of two GPU blocks. Although either block could copy its DOF value backto global memory at the end of the cycle, we choose to have the left subdomain provide the ﬁnal left pointvalue and the right subdomain provide the ﬁnal right point value in each set of overlapping points. This isbecause the local position of the left point within the left subdomain is further from the edge while comparedto its local position in the right subdomain. Meanwhile, the local position of the right point is further fromthe edge in the right subdomain than in the left subdomain. Figure 8 shows how the GPU blocks contributeto the overall global memory solution in this approach. In general, any even number of overlapping points ispermitted (as long as it is less than the number of interior points in the subdomain) and in this case the leftsubdomain should contribute the left half of points while the right subdomain should contribute the righthalf. rs bc bc bc bc bc bc bc bc bc bc bc bc rs Shared Memory of blockIdx.x = 0 rs bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 1 bc bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 2 bc bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 3 bc bc bc bc bc bc × × × ×

Shared Memory of blockIdx.x = 4 bc bc bc bc bc rs × × × × rs bc bc bc bc bc bc bc bc bc bc bc bc rs

Global MemoryGlobal Memory

Figure 8: Schematic of the transfer of data between global memory array and shared memory of GPUblocks, when subdomains have overlapping points (in this example, two interior points overlap betweeneach subdomain). This approach improves the global memory solution of interior DOFs which reside nearsubdomain edges in shared memory.Overlapping points between diﬀerent subdomains and selectively copying DOF values as described will11educe the number of cycles required for convergence. This improvement in convergence is similar to behaviorseen in overlapping domain decomposition methods, where the convergence rate is improved as more overlapis introduced between subdomains [18]. A drawback of overlapping subdomains is that multiple GPU blocksmust perform updates on many of the points throughout the domain. This requires utilization of more GPUblocks, as the overlap eﬀectively reduces the number of distinct points each subdomain handles. The numberof overlapping points is another algorithm parameter which must be chosen judiciously.Numerical tests are performed to study the eﬀect of overlap on the performance of the shared memoryalgorithm. In these tests, a thread per block value of 32 is used as before. Each subdomain can have an evennumber of interior overlapping points with its neighboring subdomain (as long as the number of overlappingpoints is less than the number of interior points in a subdomain). For our thread per block speciﬁcation of32, we permit up to 30 interior overlapping points between subdomains. The left subdomain will contributethe left half of the overlapping points while the right subdomain contributes the right half. Figures 9a and9b show the time and number of cycles required for convergence as a function of overlap for subiterationvalues of k = 4 , , , , , k . As the amountof overlap increases further the number of cycles required for convergence continues to decrease slowly andeventually saturates. The time; however, grows dramatically as overlap is increased further. This is becausethe amount of computation required per cycle grows with overlap. Given a problem size with N interiorDOFs, threads per block tpb and overlap o , the number of operational blocks required is:Operational blocks = N − tpbtpb − o + 1 (8)The logic for this is as follows. There are initially N + 2 points in the grid (including boundary points). Theﬁrst block has a subdomain size of tpb + 2 points, and every subsequent block added handles tpb − o uniquepoints in our grid. Therefore, subtracting the number of points handled by the initial block from N + 2 anddividing by the number of points handled by subsequent blocks gives the number of blocks required. Thiscan also be written as:Operational blocks = N − tpbtpb − o + 1 = N − tpb + tpb − otpb − o = N − otpb − o (9)The corresponding number of threads required is the number of blocks times the threads per block value tpb .Operational threads = (Operational blocks) tpb = (cid:18) N − otpb − o (cid:19) tpb (10)Equation (10) gives a sense of the amount of work required per cycle, which grows as overlap increases.Therefore, it is only beneﬁcial to perform overlap as long as the number of cycles decreases. Once the numberof cycles stagnates, it is no longer worth increasing amount of overlap between subdomains.For each value of k tested, Table 2 shows the optimal overlap for minimum time and correspondingspeedup of the shared memory algorithm relative to the classic GPU implementation. The results illustratethat the optimal number of overlapping points increases as the number of subiterations is increased. This isexpected because increasing k causes more DOFs near the subdomain edges to have poor values due to lackof up to date boundary values (so more overlap is necessary to improve these DOFs). The best performanceis achieved when setting k = 16 with 4 interior points overlapping between subdomains. This correspondsto an eight times speedup compared to the classic GPU performance. The best performance was previouslyachieved using k = 8, but now higher k values yield the best speedup because poor edge values (associatedwith high k ) can be improved with overlap.The numerical test results demonstrate that the convergence of Jacobi iteration for 1D problems canbe accelerated using shared memory. Although the initial results suggested a four times speedup for thisproblem (resulting from tuning the number of subiterations), introducing overlap to improve edge point12 a) Time(b) Number of Cycles Figure 9: Time and Number of Cycles required to reduce the residual as a function of overlap for diﬀerentsubiterations k . For a given k , performing overlap initially decreases the number of cycles required forconvergence (causing the time to decrease as well). The number of cycles eventually saturates as overlapincreases further. However, the time increases due to the extra computational resources required. The bestperformance is achieved with k = 16. 13alues results in an eightfold speedup. The optimal setting for the number of subiterations is about half thenumber of interior points in a subdomain. Furthermore, a small amount of overlap is enough to decrease thenumber of cycles dramatically. This shared memory approach can be extended to two dimensional problemsfor similar acceleration.Number of Subiterations 4 8

32 64 128Optimal Overlap 2 4 k = 16 and 4 interior points overlap between each subdomain. We extend the shared memory algorithm presented in Section 3.3 to the two-dimensional case. Considera PDE discretized in a structured 2D domain. A model 2D PDE is one with second order derivatives,which discretized using a 2nd order central diﬀerence scheme results in a linear system where the matrix haspentadiagonal structure as shown below ( a, d, c, e, f correspond to diﬀerent constants). A =  d c fa d c . . .. . . . . . . . . fe . . . . . . . . .. . . . . . . . . ce a d  , Applying the Jacobi iterative method to the ij th degree of freedom in the grid ( i and j correspond toindices in the spatial x , y domain) results in an update equation of the form in Equation (11). The updatedvalue of degree of freedom with indices i , j depends only on its neighbor values at the current iteration. Thisis depicted in Figure 10. x ( n +1) ij = 1 d (cid:16) b ij − ax ( n ) i − ,j − cx ( n ) i +1 ,j − ex ( n ) i,j − − f x ( n ) i,j +1 (cid:17) (11) bc bc bcbcbc x ni,j − x ni,j +1 x ni − ,j x ni +1 ,j x n +1 i,j Figure 10: 2D Stencil for Jacobi iteration performed on a pentadiagonal matrix. The value of a degree offreedom at the next step depends only on its neighbors’ values at the current step.14 .1 GPU Approach with Shared Memory

The approaches to solving a 2D PDE using Jacobi iteration on the CPU and the GPU are analogous to the1D case. For the shared memory approach, we must utilize a domain decomposition strategy. A structured2D domain can be subdivided into smaller subdomains which can be allocated to the shared memory ofmany GPU blocks and updated independently. Figure 11a illustrates a 2D domain with 12 by 12 interiorpoints partitioned into smaller 4 by 4 subdomains which would be updated by 9 GPU blocks (each colorrepresents the points handled by a distinct GPU block). The square points correspond to Dirichlet boundaryconditions.To prevent threads from being idle, we ensure that every thread updates one interior point in the sub-domain. To achieve this, the number of interior points in a subdomain is set to the number of threads perblock in the x direction (denoted by blockDim.x ) by the number of threads per block in the y direction(denoted by blockDim.y ). Updating the points on the edges of each subdomain requires information fromthe edges of neighboring subdomains. To prevent the need for communication of neighbors at every step,we simply augment each subdomain by one point in all dimensions as shown in Figure 11b. Therefore, thesubdomains to be copied to shared memory have size blockDim.x + 2 by blockDim.y + 2 where the inner blockDim.x by blockDim.y points will be updated using the Jacobi update equation. rs rs rs rs rs rs rs rs rs rs rs rs rs rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs rs rs rs rs rs rs rs rs rs rs rs rs rs (a) 2D Domain and Subdomains rs rs rs rs rs rs rs rs rs rs rs rs rs rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs bc bc bc bc bc bc bc bc bc bc bc bc rsrs rs rs rs rs rs rs rs rs rs rs rs rs rs (b) 2D Augmented Subdomains Figure 11: Illustration of 2D Domain which is partitioned into several subdomains. The augmented subdo-mains (shown on the right with dashed lines identifying each subdomain) are copied from global memory tothe shared memory of diﬀerent GPU blocks. Each GPU block will perform updates for one subdomain.The ﬁrst step of the 2D shared memory algorithm is to copy the subdomain data from global memoryto the shared memory of the GPU block which will perform the updates. The blockDim.x by blockDim.y threads in each block can be used to copy the blockDim.x + 2 to blockDim.y +2 subdomains to sharedmemory (each thread must transfer at least one point). After the copy to shared memory, the interior pointsof the subdomain can be advanced using the stencil update in Equation 11. Rather than perform just oneupdate, we can perform multiple subiterations similar to the 1D case. Each thread is responsible for updatingone DOF. After the updates are complete, the interior points of the subdomains can be copied back fromshared memory to their appropriate positions in global memory. At this point, the most up to date solutionlies in global memory and the shared memory can be terminated.This completes one cycle of the shared memory algorithm in 2D. The main parameters of this algorithm15re the number of threads per block in the x -direction ( blockDim.x ) and y -direction ( blockDim.y ), whichcontrol the dimensions of the subdomains, as well as the number of subiterations performed within sharedmemory (denoted by k as before). The algorithm should be implemented as a single GPU kernel so thatshared memory is not terminated until the end. In summary, each cycle involves the following three steps(similar to the 1D case):1. Partition the gridpoints between the diﬀerent GPU blocks and copy values from global memory toshared memory. Each subdomain has size blockDim.x + 2 by blockDim.y + 2.2. Update the interior blockDim.x by blockDim.y points in each GPU block by performing k Jacobiiterations.3. Update the global memory solution array by copying the interior blockDim.x by blockDim.y valuesfrom shared memory to their appropriate positions in global memory.We can compute the amount of shared memory required for the 2D algorithm. Since Jacobi iterationrequires two storage containers, we must allocate enough data for twice the subdomain size of (blockDim.x + 2) * (blockDim.y + 2) . Furthermore, the right hand side values corresponding to theinterior subdomain DOFs can also be copied to shared memory to improve performance (as the right hand sideappears in the update). This requires another blockDim.x ∗ blockDim.y array. Therefore, the total amountof shared memory required is then 2 ∗ ( blockDim.x + 2 ) ∗ ( blockDim.y + 2 ) + ( blockDim.x ∗ blockDim.y ).This must be multiplied by the size of the representation (4 bytes for ﬂoats, 8 bytes for doubles).One issue that can arise in shared memory applications is the problem of bank conﬂicts, which occurwhen threads in a given warp access non-contiguous data in shared memory. In this case, some fractionof the shared memory accesses are serialized which can signiﬁcantly reduce memory bandwidth. To avoidperformance loss, all warps should access contiguous data. In the 2D problem, each row of interior pointsin a subdomain lies in contiguous memory, but each subsequent row of interior points is separated from theprevious row by two boundary points (due to the actual layout in memory as a 1D array). Therefore, weshould ensure that each warp updates a single row. This is done by setting the threads per block value inthe x direction to 32, so that each row has 32 interior points (contiguous in memory) which will be updatedby the 32 threads in a given warp. We study the performance of the shared memory algorithm in solving the pentadiagonal system arising fromdiscretization of the 2D Poisson equation in the domain x ∈ [0 , × y ∈ [0 , − (cid:18) ∂ u∂x + ∂ u∂y (cid:19) = f ( x, y ) , u (0 , y ) = u (1 , y ) = u ( x,

0) = u ( x,

1) = 0 (12)The pentadiagonal matrix in our linear system Ax = b is given by A =  d a ca d a . . .. . . . . . . . . cc . . . . . . . . .. . . . . . . . . ac a d  , where a = − x ) , c = − y ) , d = 2(∆ x + ∆ y )16he elemental Jacobi iterative update for the 2D Poisson problem is: x ( n +1) i,j = 1 x + y (cid:20) b i,j + 1∆ x (cid:16) x ( n ) i − ,j + x ( n ) i +1 ,j (cid:17) + 1∆ y (cid:16) x ( n ) i,j − + x ( n ) i,j +1 (cid:17)(cid:21) (13)= b i,j (∆ x ) + x ( n ) i − ,j + x ( n ) i +1 ,j + x ( n ) i,j − + x ( n ) i,j +1 x = ∆ y (14)We consider a domain containing N x = 1024 by N y = 1024 interior grid points in the x and y directionsand measure the time to reduce the L residual norm of the initial solution by a factor of 1e-4. This problemis analogous to the previous 1D problem in terms of computational complexity, and requires full utilizationof the GPU. The entries of the initial solution and the right hand side vector are set to one. A TITAN VGPU is used for the numerical tests, which are performed using double precision.The number of threads per block in the x direction is set to 32 to avoid shared memory bank conﬂicts.For the classic GPU approach, the threads per block value in the y direction is varied between 4 , , , y direction is set to 32 so that the problem domain is split into subdomains with 32 by 32 interior points. Theamount of shared memory required using these speciﬁcations is 26.688 kB per block (less than the maximumallowable value of 48 kB, although doubling either dimension of the subdomain would cause us to exceed thisvalue). The number of subiterations is set to the following values (the same values as in the 1D problem): k = 4 , , , , , . k = 16. The timegrows linearly as the number of subiterations is increased past this point.Figure 12: Comparison of time required for classic GPU and shared memory approaches to reduce theresidual of the solution by 1e-4 for a 2D problem size of N x = 1024 by N y = 1024. The shared memoryapproach outperforms the best classic GPU time in all cases, and the best performance is achieved when k = 16.Table 3 shows the speedup obtained when comparing the best classic GPU performance to the sharedperformance for each of the subiteration values explored. A nearly four times speedup is achieved using the17hared memory implementation (similar to the initial 1D speedups) when k = 16, demonstrating that sharedmemory can improve the convergence of Jacobi iteration for structured 2D problems.Number of Subiterations 4 8

32 64 128Best GPU to Shared Speedup 2.17 3.67 k = 16. The shared memory approach improves the performance of classic Jacobi iteration on the GPU by allowingmany updates to be performed without requiring global memory accesses every step. However, this comes atthe cost of lacking up to date boundary values every subiteration. In the 2D problem, the left, right, top andbottom boundary edges are only up to date at the beginning of a cycle but not for all k subiterations. Thiscauses DOFs close to these subdomain edges to rely on old boundary data and have poor solution valueswhich contribute greatly to the residual norm (especially for larger values of k ). Improving the interior valuesnear subdomain edges (especially interior points at the corners) can greatly improve the convergence of ouralgorithm. This can be done by extending the overlapping approach in Section 3.5 to 2D problems. In the2D case, sets of points can overlap between subdomains in both the x and y directions. Therefore, a DOFcan be allocated to and updated within multiple subdomains. However, the subdomain in which the localposition of the DOF is further from a subdomain edge in both the x and y directions should contribute theﬁnal value at the end of a cycle. This is a natural extension of the 1D overlapping approach to the 2D case.We explore the performance of the 2D shared memory algorithm with overlap. As before, the threadper block values are set to 32 in both directions. In this case, a subdomain can have up to 30 interiorpoints overlap with its neighboring subdomains in both directions. While it is permissible to have a diﬀerentamount of overlap in the x and y directions, the symmetry of the domain and subdomain sizes suggests thatthe optimal overlap amount would be the same in both directions. Figures 13a and 13b show the time andnumber of cycles required for the shared memory approach to converge as the overlap in both directionsis increased. The same subiteration values are used as before. Figure 13b shows that an initial amount ofoverlap can signiﬁcantly drop the number of cycles required for convergence, especially when the number ofsubiterations is large. The corresponding time also decreases, except in the case when k = 4, as the decreasein the number of cycles is not enough to outweigh the extra work incurred due to overlap. As the number ofoverlapping points increases further, the number of cycles saturates. However, the time continues to increasebecause larger overlapping values require more GPU resources and computational time per cycle.We conduct a similar analysis to the 1D case to determine the amount of work required per cycle as afunction of the overlap. The 1D analysis suggested that the number of operational blocks was given by:Operational blocks = N − otpb − o (15)For the 2D case, assume a domain with interior dimensions N x by N y , with subdomains with interiordimensions of tpb x by tpb y . If the overlap in the x and y direction are given by o x and o y , the total numberof operational blocks required is thenOperational blocks = (cid:18) N x − o x tpb x − o x (cid:19) (cid:18) N y − o y tpb y − o y (cid:19) (16)The number of operational threads is then the number of operational blocks times the threads per block inthe x and y directions Operational threads = (cid:18) N x − o x tpb x − o x (cid:19) (cid:18) N y − o y tpb y − o y (cid:19) tpb x tpb y (17)18 a) Time(b) Number of Cycles Figure 13: Time and Number of Cycles required for convergence as a function of overlap for diﬀerent numberof subiterations k for the 2D problem. The behavior is similar to the 1D case. The best performance isachieved with k = 16. 19ssuming that the parameters in the x and y directions are equivalent ( N x = N y = N , tpb x = tpb y = tpb , o x = o y = o as is true in our problem) Equation (17) simpliﬁes toOperational threads = (cid:18) N − otpb − o (cid:19) tpb (18)Equation 18 shows the amount of work or computational resources required per cycle based on the domainsize, threads per block and overlap. Increasing the overlap increases the amount of work required per cycle,so overlap is useful if it decreases the number of cycles needed for convergence enough to outweigh the extrawork it requires. Once the number of cycles saturates with overlap, it is no longer useful to perform furtheroverlap.Table 4 shows the optimal overlap value and corresponding speedup of the shared memory algorithmrelative to the classic GPU implementation for the 2D problem, for each of the subiteration values used.As before, the optimal overlap increases with the number of subiterations. A nearly six times speedup isachieved using k = 32 with 4 overlapping points in the x and y directions. The numerical test resultsNumber of Subiterations 4 8 16

64 128Optimal Overlap in x and y k = 32 with4 overlapping points between subdomains in both directions.demonstrate that the convergence of Jacobi iteration for 2D problems can also be accelerated using sharedmemory. Tuning the number of subiterations initially resulted in a four times speedup over the classicalapproach. However, introducing overlap to improve edge values augmented the speedup further, resultingin an almost six times speedup. The optimal number of subiterations was equal to the number of interiorpoints in a subdomain along a single dimension. Furthermore, introducing a small amount of overlap in bothdirections was enough to decrease the number cycles dramatically. In this work, we have developed a hierarchical Jacobi solver for 1D and 2D structured problems that utilizesshared memory to improve performance. The approach involves partitioning the domain into multiplesubdomains which are handled by GPU blocks. Relative to a classical GPU implementation of the Jacobiiterative method, the shared memory approach resulted in an eight times speedup in convergence for the1D case and a nearly six times speedup for the 2D case. This speedup was achieved by exploring optimalparameters for the number of subiterations performed in shared memory and investigating the eﬀect ofoverlapping subdomains (which accelerated convergence by improving DOF values near subdomain edges).For both the 1D and 2D test problems, setting the number of subiterations on the order of the subdomaindimension along with a small amount of overlap resulted in the optimal performance of the shared memoryalgorithm. This hierarchical Jacobi solver is very suitable for parallel architectures, and has the potential toaccelerate multigrid solvers which utilize this approach as a smoother on high performance clusters. Overall,this work demonstrates the need to adopt classical algorithms to emerging hardware in a way that achievesboth high computational throughput as well as eﬃcient memory access.20

Sample Pseudocode for 1D Shared Memory Algorithm

We provide a sample CUDA C++ style code to illustrate the components of the shared memory algorithmin 1D from an implementation standpoint. The code resembles the implementation of our GPU kernelwhich corresponds to one cycle of the shared memory algorithm. In this ﬁrst step, the subdomain DOFdata is copied from global memory to 2 containers in shared memory (corresponding to containers for thecurrent and updated values in Jacobi iteration). The right hand side values for DOFs corresponding to oursubdomain are also copied to a third section of shared memory. In the second step, Jacobi subiterations areperformed in shared memory. We have two storage containers for Jacobi iteration, with x0 referring to thecurrent solution and x1 referring to the updated solution in every step. In step 3, the updated values of allinterior points in shared memory are copied back to their respective global memory positions. / ∗ D e f i n e s h a r e d memory ∗ /e x t e r n s h a r e d M e m o r y d o u b l e sharedMemory [ ] ;/ ∗ D e f i n e s e v e r a l c o n s t a n t s ∗ /i n t s u b d o m a i n S i z e = blockDim . x + 2 ;i n t i = t h r e a d I d x . x ;i n t I = t h r e a d I d x . x + blockDim . x ∗ b l o c k I d x . x ;/ ∗ STEP 1 − Copy d a t a from g l o b a l memory t o s h a r e d memory ∗ /sharedMemory [ i ] = globalMemory [ I ] ; // 1 s t c o n t a i n e rsharedMemory [ i + s u b d o m a i n S i z e ] = globalMemory [ I ] ; // 2nd c o n t a i n .sharedMemory [ i + 2 ∗ s u b d o m a i n S i z e ] = r h s [ I + 1 ]i f ( t h r e a d I d x . x == 0 | | t h r e a d I d x . x == 1) { sharedMemory [ i + blockDim . x ] = globalMemory [ I + blockDim . x ] ;sharedMemory [ i + blockDim . x + s u b d o m a i n S i z e ] = . . .globalMemory [ i + blockDim . x ] } / ∗ STEP 2 − Perform s u b i t e r a t i o n s w i t h i n s h a r e d memory ∗ /i = t h r e a d I d x . x + 1d o u b l e ∗ x0 = sharedMemory ; // 1 s t c o n t a i n e rd o u b l e ∗ x1 = sharedMemory + s u b d o m a i n S i z e ; // 2nd c o n t a i n e rd o u b l e ∗ s h a r e d r h s = sharedMemory + 2 ∗ s u b d o m a i n S i z ef o r ( i n t k = 0 ; k < n u m S u b i t e r a t i o n s ; k++) { i n t l e f t v a l u e = x0 [ i − } / ∗ STEP 3 − T r a n s f e r from s h a r e d memory t o g l o b a l memory ∗ /globalMemory [ I +1] = sharedMemory [ i ] eferences [1] A. Gaikwad and I. M. Toke. “Parallel Iterative Linear Solvers on GPU: A Financial EngineeringCase”. In: .2010, pp. 607–614 (cit. on p. 1).[2] You-Liang Zhu et al. “GALAMOST: GPU-accelerated large-scale molecular simulation toolkit”. In: Journal of Computational Chemistry

GPU Gems . 1st. Boston, MA: Addison-Wesley, 2004 (cit. on p. 1).[4] Jeﬀ Bolz et al. “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid”. In:

ACMTransactions on Graphics

The Journalof Supercomputing (2012) (cit. on p. 1).[6] Gene H. Golub and Charles F. Van Loan.

Matrix computations . 4th. Baltimore: The Johns HopkinsUniversity Press, 2013. isbn : 1421407949 (cit. on p. 2).[7] William L. Briggs, Van Emden Henson, and Steve F. McCormick.

A Multigrid Tutorial . SIAM, 2000(cit. on p. 2).[8] Zhihui Zhang, Qinghai Miao, and Ying Wang. “CUDA-Based Jacobi’s Iterative Method”. In: . Vol. 1. 2009, pp. 259–262(cit. on p. 2).[9] Tao Wang et al. “Implementation of Jacobi iterative method on graphics processor unit”. In: . Vol. 3. 2009, pp. 324–327(cit. on p. 2).[10] Ronan Amorim et al. “Comparing CUDA and OpenGL implementations for a Jacobi iteration”. In: . 2009, pp. 22–32 (cit. onp. 2).[11] Jos´e Mar´ıa Cecilia, Jos´e Manuel Garc´ıa, and Manuel Ujald´on. “CUDA 2D Stencil Computations forthe Jacobi Method”. In:

Applied Parallel and Scientiﬁc Computing . Ed. by Kristj´an J´onasson. Berlin,Heidelberg: Springer Berlin Heidelberg, 2012, pp. 173–183 (cit. on p. 2).[12] Sundaresan Venkatasubramanian and Richard W. Vuduc. “Tuned and wildly asynchronous stencilkernels for hybrid CPU/GPU systems”. In: Jan. 2009, pp. 244–255. doi : (cit. on p. 2).[13] Abal-Kassim Cheik Ahamed and Fr´ed´eric Magoul`es. “Eﬃcient implementation of Jacobi iterativemethod for large sparse linear systems on graphic processing units”. In: The Journal of Supercom-puting

73 (2016), pp. 3411–3432 (cit. on p. 2).[14] David B. Kirk and Wen-mei W. Hwu.

Programming Massively Parallel Processors, Third Edition: AHands-on Approach . 3rd. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2016. isbn :0128119861 (cit. on pp. 2, 3).[15] Mark Harris.

Using Shared Memory in CUDA C/C++ . Jan. 2013. url : https://devblogs.nvidia.com/using-shared-memory-cuda-cc/ (cit. on pp. 2, 4, 6).[16] Jacques Mohcine Bahi, Sylvain Contassot-Vivier, and Raphael Couturier. Parallel Iterative Algorithms:From Sequential to Grid Computing . 1st. New York: Chapman and Hall, 2007. isbn : 9780429142024(cit. on p. 3).[17] Nicholas Wilt.

The CUDA Handbook: A comprehensive Guide to GPU Programming . 1st. Addison-Wesley Professional, 2013 (cit. on pp. 4, 6).[18] Victorita Dolean, Pierre Jolivet, and Fr´ed´eric Nataf.