[PDF] Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

Abstract

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the uttermost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. A Field-Programmable Gate Array is a reconfigurable hardware accelerator that is fully customizable in terms of computational resources and memory storage requirements of an application during its lifetime. Therefore, it is an ideal candidate to accelerate scientific computing applications because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers found in scientific libraries. In this paper, we study the potential of using FPGA in HPC because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we first propose a novel ILU0 preconditioner tightly integrated with a BiCGStab solver kernel designed using a mixture of High-Level Synthesis and Register-Transfer Level hand-coded design. Second, we integrate the developed preconditioned iterative solver in Flow from the Open Porous Media (OPM) project, a state-of-the-art open-source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both standalone mode and integrated into the reservoir simulator that includes all the on-chip URAM and BRAM, on-board High-Bandwidth Memory, and off-chip CPU memory data transfers required in a complex simulator software such as OPM's Flow. We evaluate the performance on the Norne field, a real-world case reservoir model using a grid with more than 10^5 cells and using 3 unknowns per cell.

Full PDF

HHardware Acceleration of HPC Computational FlowDynamics using HBM-enabled FPGAs

Tom Hogervorst ‡ , Tong Dong Qiu ∗ , Giacomo Marchiori ∗ , Alf Birger † , Markus Blatt § , Razvan Nane ∗‡∗ Big Data Accelerate B.V., Delft, The Netherlands † Equinor S.A., Bergen, Norway § OPM-OP AS, Oslo, Norway ‡ Delft University of Technology, Delft, The [email protected]

Abstract —Scientiﬁc computing is at the core of many High-Performance Computing (HPC) applications, including compu-tational ﬂow dynamics. Because of the uttermost importanceto simulate increasingly larger computational models, hardwareacceleration is receiving increased attention due to its potentialto maximize the performance of scientiﬁc computing. A Field-Programmable Gate Array (FPGA) is a reconﬁgurable hardwareaccelerator that is fully customizable in terms of computationalresources and memory storage requirements of an applicationduring its lifetime. Therefore, it is an ideal candidate to acceleratescientiﬁc computing applications because of the possibility tofully customize the memory hierarchy important in irregularapplications such as iterative linear solvers found in scientiﬁclibraries. In this paper, we study the potential of using FPGA inHPC because of the rapid advances in reconﬁgurable hardware,such as the increase in on-chip memory size, increasing numberof logic cells, and the integration of High-Bandwidth Memorieson board. To perform this study, we ﬁrst propose a novel ILU0preconditioner tightly integrated with a BiCGStab solver kerneldesigned using a mixture of High-Level Synthesis (HLS) andRegister-Transfer Level (RTL) hand-coded design. Second, weintegrate the developed preconditioned iterative solver in

Flow from the Open Porous Media (OPM) project, a state-of-the-artopen-source reservoir simulator. Finally, we perform a thoroughevaluation of the FPGA solver kernel in both standalone modeand integrated into the reservoir simulator that includes all theon-chip URAM and BRAM, on-board High-Bandwidth Memory(HBM), and off-chip CPU memory data transfers required in acomplex simulator software such as OPM’s

Flow . We evaluate theperformance on the Norne ﬁeld, a real-world case reservoir modelusing a grid with more than cells and using 3 unknownsper cell, and we ﬁnd that the FPGA is on par with the CPUexecution and 3 times slower than the GPU implementation whencomparing only the kernel executions. I. I

NTRODUCTION

The advent of data proliferation and the slowing down ofMoore’s law is driving many application designers to imple-ment an algorithm’s computationally demanding functionalityusing a hardware accelerator. While Graphical ProcessingUnits (GPUs) beneﬁt from a larger user base because oftheir massive amount of parallel cores, on board HBMs, andease of programming, FPGAs are gaining recently a lot ofattention because of their application-speciﬁc customizability.Furthermore, the recent integration of HBMs in FPGAs-basedsystems coupled with the maturing of HLS programming tools [1], have touted the FPGA as a viable competitor for GPUsin emerging machine learning-related domains, but also inHPC applications where GPUs were traditionally the solecandidate. The acquisition of Altera by Intel in 2016 [2] andthe recent announcement of AMD acquiring Xilinx [3], two ofthe biggest FPGA providers, can be considered an indicationof this trend.In this context, it is highly relevant to reassess the potentialof FPGAs for HPC, when we consider the addition of HBMon board, the increase of on-chip memory (e.g., Ultra RAM(URAM) [4]), and the continuing increase of logic cell re-sources enabling one to implement complex functionality. Toperform this evaluation, we look at an ubiquitous scientiﬁccomputing ﬁeld used in many HPC application domains.Namely, the solution of systems of non-linear partial differ-ential equations in time and space. These are typically notpossible to solve exactly through analytical methods. Hence,the equations are discretized in time and over a domainspace using ﬁnite difference, ﬁnite volume or ﬁnite elementmethods. Finally, the derived system of equations is solved,often iteratively using a combination of preconditioners anditerative linear solvers available in one of the many scientiﬁcsolver libraries available, such as DUNE [5]. PETSc [6] orothers. Due to the complexity of implementing linear solverson FPGA, there are only a few previous works that studied theacceleration of them on FPGA. However, many assumptionswere made to simplify the design and implementation, suchas either focusing solely on single precision ﬂoating pointsupport, or using simple solvers without preconditioners, andall ignored the integration into the whole scientiﬁc comput-ing pipeline. However, to realistically assess the potentialof FPGAs, it is important to eliminate the aforementionedrestrictions. Consequently, in this work we develop a completesolver with a preconditioner on FPGA with support for doubleprecision, which is utterly important in a real-world settingbecause it enables the convergence to a solution. Furthermore,we integrate our work in a state-of-the-art reservoir simulator,

OPM Flow [7], and validate our work by running with a real-world use case containing a large number of grid cells ( ).To accomplish our study goals, we develop a novelBiCGStab [8] solver implementation with ILU0 [9] as pre- a r X i v : . [ c s . A R ] J a n onditioner on FPGA. This combination of preconditioner andlinear solver was chosen to get an apples-to-apples comparisonwith the CPU performance on the Norne ﬁeld data set, as thiscombination was best performing in that case. To maximize theperformance, we use a mixed programming model in whichwe combine HLS with manually written RTL. Please notethat there might be other solvers more efﬁcient for a parallelarchitecture such as the FPGA (and GPU) than an ILU0preconditioner. However, this is left for future work becauseit was most important to make certain the results are closeto the ones obtained by the already validated software onlysimulation. Therefore, we selected the same combination ofpreconditioner and solver as those used by default in OPMFlow and designed the solver to support double precisionbecause high accuracy is required, otherwise the selected realworld case would not converge. Furthermore, integration iskey to benchmark the performance in the overall system.Previous work has only benchmarked the solvers in isolation,whereas we consider the whole system for benchmarking,where including the memory transfers between off-chip andoff-board is crucial. Finally, we make our work available inthe OPM git repository [10].The contributions of this paper are summarized as follows: • We propose a novel Compressed Sparse Row (CSR)-based encoding to optimize the Sparse Matrix VectorMultiplication (SpMV) on FPGA, which is the corefunctionality required to implement iterative solvers. • We develop a hardware model to predict and guide thedesign of the solver, helping us to understand trade-off between area and performance when scaling theresources. • We design a novel ILU0-BiCGStab preconditioned solveron HBM-enabled FPGA using a mixed programmingmodel of HLS and RTL. • We integrate the proposed solver in

OPM Flow , bothfor FPGA and GPU. Furthermore, we make all sourcesavailable in the OPM git repository. • We provide extensive evaluation results for both stan-dalone solver kernel benchmarking as well as completereservoir simulation execution on a real world use caserunning on three different types of systems, CPU, FPGA,and GPU.In the next section II, we introduce the background requiredto understand the paper, while in section III we highlightrelated works. We then start a bottom-up explanation approachin which we ﬁrst describe the SpMV kernel in section II-B,then the solver in section V, and discuss the performancehardware model used to guide the design in section VI. Finally,in section VII, we present the experimental results.II. B

ACKGROUND

A. Sparse Matrix Iterative Solvers

The central problem in this paper is to solve A ∗ (cid:126)x = (cid:126)b forvector (cid:126)x , where A is a known sparse matrix and (cid:126)b is a knowndense vector. There are different methods for solving such a problem, which can be grouped into direct solvers and iterativesolvers. Direct solvers, as the name implies, use linear algebratechniques to solve the problem for x directly and precisely.However, direct methods are not suitable to solve large sparselinear systems, as it is usually the case in a real-world HPCapplication, because they will ﬁll-in the zero values of thesparse matrix in order to solve the problem, resulting in a veryhigh computational cost and memory usage. Iterative solversare a heuristic alternative to direct solvers: they start from aninitial guess of the solution, and then iteratively improve thisguess, until it is deemed close enough to the actual solution.In this paper, we use the BiCGStab (bi-conjugate gradientstabilized) solver with the ILU0 (incomplete LU factorizationwith no ﬁll-in) as preconditioner because this combination isthe most robust and the default software conﬁguration usedin OPM Flow . Consequently, besides the goal of studyingthe acceleration potential of FPGAs, another implicit goalrequired by the reservoir engineers using the simulator wasto obtain simulation results close to the software counterpart.Consequently, although it might not be the best choice for ac-celeration on a parallel architecture, fulﬁlling the conformancecriteria led us to use the same solver used in the software stack.Algorithm 1 shows the pseudo-code of the BiCGStab solver.This code makes use of several other functions: • An SpMV function, which performs a Sparse Matrix-Vector (SpMV) multiplication • A dot function, which calculates the dot product (or innerproduct) of two vectors • An axpy function, which scales its ﬁrst input vector bya scalar value, and adds it to its second input vector (i.e. axpy ( α, (cid:126)x, (cid:126)y ) performs: α ∗ (cid:126)x + (cid:126)y ). • A preconditioner function. This is an optional part ofthe solver, and does not have a set behavior. For moreinformation about preconditioners, see section II-C.The result of the solver are two vectors: (cid:126)x , which is theestimation of the solution of A ∗ (cid:126)x actual = (cid:126)b , and (cid:126)r = (cid:126)b − A(cid:126)x , aresidual vector, which is an indication of how close the result (cid:126)x vector is to the exact solution (cid:126)x actual . It is worth notingin Algorithm 1 how the exit condition of the solver is set, orin other words, how the algorithm determines when to stopiterating because the results are sufﬁciently close to the actualresult. The solver stops iterating when the norm of the residualis lower than a certain convergence threshold. There are twoways how this convergence threshold can be chosen: either itis a constant value, in which case we speak of an absolute exitcondition, or it is a value set by the norm of the initial residualby a desired improvement factor, in which case we speak ofa relative exit condition. An absolute exit condition can beuseful when one needs to be certain that the result of a solveris at least within a certain numerical distance from the actualresult. However, when solving for multiple different matricesfrom different domains, a matrix with lower values all acrossthe board might start off at a lower error and also reach theabsolute exit condition more quickly. In such a case, a relativeexit condition leads to a more fair comparisons between solves,hich is why a relative exit condition is used in this paper.Ignoring for now the preconditioner, what remains in theBiCGSTAB solver are a series of vector operations andSpMVs, as well as a few scalar operations. From proﬁling thesolver, it is clear that the majority of its computational timeis spent on performing SpMV and SpMV like computations.Therefore, the ﬁrst goal to design a performant solver kernelwould be to perform the SpMV efﬁciently.

Algorithm 1

Pseudo-code BiCGSTAB solver

Input : (cid:126)b , (cid:126)x , A (cid:46) (cid:126)b and (cid:126)x are N element arrays, A is anNxN matrix Output : (cid:126)r , (cid:126)x (cid:46) Both are N element arrays (cid:126)r = (cid:126)b − spmv ( A, (cid:126)x ) (cid:126)rt = (cid:126)r ρ = 0 (cid:126)p = (cid:126) norm = sqrt ( dot ( (cid:126)r, (cid:126)r )) conv threshold = norm/desired improvement while norm > conv treshhold do ρ new = dot ( rt, r ) β = ( ρ ∗ α ) / ( ρ new ∗ ω ) (cid:126)p = axpy ( β, (cid:126)r, axpy ( ω, (cid:126)p, (cid:126)v )) (cid:126)y = preconditioner ( (cid:126)p ) (cid:126)v = spmv ( A, (cid:126)p ) α = ρ new /dot ( (cid:126)rt, (cid:126)v ) ρ = ρ new (cid:126)x = axpy ( α, (cid:126)x, (cid:126)y ) (cid:126)r = axpy ( − α, (cid:126)r, (cid:126)v ) norm = sqrt ( dot ( (cid:126)r, (cid:126)r )) if norm ≤ conv threshold then break end if (cid:126)y = preconditioner ( (cid:126)r ) (cid:126)t = spmv ( A, (cid:126)p ) ω = dot ( (cid:126)r, (cid:126)t ) /dot ( (cid:126)t, (cid:126)t ) (cid:126)x = axpy ( ω, (cid:126)x, (cid:126)y ) (cid:126)r = axpy ( − ω, (cid:126)r, (cid:126)t ) norm = sqrt ( dot ( (cid:126)r, (cid:126)r )) end while B. Sparse Matrix-Vector Multiplication

The SpMV is a widely used, rather elementary operationin which a sparse matrix is multiplied by a dense vector.How this operation will be performed algorithmically dependson how the matrix is stored. A sparse matrix consist ofmany more zeroes than non-zero values, which means thatstoring all of them is not efﬁcient. There are a variety ofdifferent formats in which only the non-zero values of a sparsematrix can be stored. Among these formats, CompressedSparse Column (CSC) and Compressed Sparse Row (CSR) aretwo of the most commonly used, with both storing matricesefﬁciently without putting any restrictions on the content ofthe matrix, whereas other format may require the matrix to have certain sparsity pattern features such as symmetry, valuesbeing repeated multiple times, or store additional non-zeroes,e.g. sliced ELLPACK [11], in order to be efﬁcient. CSR andCSC do not have such restrictions. In this paper, we focus onlyon the CSR representation, as the column-major CSC formatwould be more difﬁcult to use during the application of theILU0 preconditioner because ILU0 goes through the matrixrow-by-row (see section II-C), which would be difﬁcult toimplement in the column-major CSC format.The CSR format consists of three arrays: an array with allnon-zero values of the matrix stored row-major, an array of thecolumn indices of those non-zero values, and an array of rowpointers, which contains an index for each row to identifywhich non-zero value is the ﬁrst value of that row, plus alast additional entry containing the total number of nonzeroesstored. The pseudo-code showing how the SpMV is performedon a matrix stored in the CSR format is given in algorithm 2.It is worth noting that there is a high degree of parallelismavailable in this SpMV. The simple explication is that thecalculations for all rows can be done in parallel because theresult of one row is never used during the calculation ofanother row. Nevertheless, this is true only in theory becausethe irregular memory accesses in the x vector, which depend onthe random column index representing the location of the non-zero value in the system matrix, restrict in practice the amountof parallelism that can be exploited efﬁciently. The mainbottleneck is the requirement to replicate the x vector to enablemultiple read units to feed independent row processing units.This replication is impractical for realistic large use cases fortwo main reasons: 1) limited on-chip (i.e., cache) memory and2) signiﬁcant preprocessing computations required to align theread data feed into the multiple processing units, which isneeded to exploit the maximum available bandwidth in state-of-the-art accelerators with HBM.

Algorithm 2

Pseudo-code of a CSR SpMV

Input : nnzs , colInds , rowP ointers , x(cid:46) nnz is an M element array of ﬂoating point values (cid:46) colInds is an M element array of integers (cid:46) rowP ointers is an N+1 element array of integers (cid:46) x is an N element array of ﬂoating point values Output : y (cid:46) y is an N element array of ﬂoating point values for ≤ i < N do y [ i ] = 0 for rowP ointers [ i ] ≤ v < rowP ointers [ i + 1] do y [ i ] = y [ i ] + nnzs [ v ] ∗ x [ colInds [ v ]] end for end for C. The ILU0 Preconditioner

How hard a series of linear equations is to solve, dependson various different aspects of those equations, or the matrixthat represents them. In other words, how many non-zeroesthe matrix has, where they are located in the matrix, and theiralues in relation to one another, among other things, all playa role. The condition number κ ( A ) of a matrix A , whichis the ratio of the biggest eigenvalue of the matrix dividedby the smallest one, is a measure of how relative changesin the right hand side b of a linear equation A ∗ (cid:126)x = (cid:126)b result in relative changes of the solution (cid:126)x . In addition it ameasure of how changes in A affect the solution and hownumerical errors affect the solution. Often, convergence ofiterative methods is worse for matrices with high conditionnumbers. Therefore, while iterative solvers such as BiCGStabare a powerful method for solving sparse matrix problems,they might still take many iterations before they approach theresult within the desired precision when solving matrices witha high condition number.One technique for reducing the amount of iterations neededfor BiCGStab and other Krylov methods is utilizing a pre-conditioner. Generally, a preconditioner is a transformationor matrix M that approximate A , but M ∗ (cid:126)x = (cid:126)b is lesscomputationally expensive to solve than A ∗ (cid:126)x = (cid:126)b . If M is amatrix then the preconditioned method solves M − A ∗ (cid:126)x = (cid:126)b .Note that if M is a good approximation for A , then thissystem will have a much better condition number than theoriginal one and hence the iterative will converge faster. Asimple example of a preconditioner is the Jacobi method with M = diag ( A ) , the diagonal values of A . Many different typesof preconditioners exist, from very simple ones that only veryweakly approximate the problem’s solution, to complex onesthat almost solve the system by themselves. Unfortunately,how well a preconditioner performs depends not only on thetype of preconditioner, but also on the matrix being solved.For example, the Jacobi preconditioner is fairly accurate ifthe original matrix has most of its non-zeroes clustered aroundthe diagonal, and if the values on the diagonal have a highermagnitude than those off the diagonal, but does not work aswell for matrices that do not have those characteristics.In this paper, we utilize the zero ﬁll-in incomplete LUfactorization (ILU0) as preconditioner, which falls somewherein the middle: it approximates the system reasonably whilenot being too computationally intensive to apply. Using anILU0 consists of two steps: ﬁrst, before the solver is started,an incomplete LU decomposition is performed. To understandwhat this is, we must ﬁrst introduce the LU decomposition.LU decomposition is a direct solving method that calculatesan upper-triangular matrix U and a lower-triangular matrixL so that A = L ∗ U . Such a decomposition can be doneusing Gaussian elimination, among other methods. Once sucha decomposition has been found, A ∗ (cid:126)x = (cid:126)b can be solved for (cid:126)x by ﬁrst solving L ∗ (cid:126)y = (cid:126)b for (cid:126)y and then solving U ∗ (cid:126)x = (cid:126)y for (cid:126)x . Since L and U are triangular matrices, these two solves canbe done using forward and backward substitution respectively.However, like we discussed in the previous section, directsolving method become very computationally intensive forsparse matrices because they start ﬁlling in the zero values ofthe matrix, and LU decomposition is no exception. IncompleteLU(N) decomposition decreases the high computational costof LU decomposition for sparse matrices by reducing the amount of ﬁll-in that happens, where N is an indicator of howmuch ﬁll-in is done. The case were N = 0 (ILU0) means thatfor the L and U matrices the decomposition generates non-zero values only at the same indices where A had non-zerovalues.Once the ILU0 decomposition is complete, the solver canstart. The solver will then apply the ILU0 preconditioner inevery iteration. Pseudo-code of the ILU0 application is shownin algorithm 3. The ILU0 application consists of two parts,represented by the two for loops in lines 1 and 11: a forwardsubstitution on the lower-triangular part of the LU matrix,and a backward substitution on the upper-triangular part ofthat matrix. Both loops are very similar in the operationsthey perform, with the main differences being that the for-ward substitution traverses the matrix top-to-bottom, whilethe backward substitution traverses the matrix bottom to top,and that the backward substitution concludes by dividing theresult array by the values on the diagonal of the LU matrix.It is worth noting that while these loops ostensibly have thesame basic structure as the SpMV loop, there is one majordifference: the results of one row depend on rows before it,which means the operations on the rows can not be performedall in parallel. See section II-D for more on how this issue canbe circumvented somewhat by re-ordering the matrix.To quantify the effect of the ILU0 preconditioner in practicefor our solver performance , we ran software implementationsof the solver without preconditioner and with the Jacobi andILU0 preconditioners on several benchmark matrices. Thematrices were obtained from the SuiteSparse matrix collection[12], [13]. TABLE I

CPU SOLVER TIME IN MS WITH − & − EXIT CONDITION

Precond. None None Jacobi Jacobi ILU0 ILU0Matrix − − − − − − Hummocky 5.9 925 bcsstk25 5.8 3396 wathen120 8.0 112 gridgena

403 5.9 643.3 18.7 qa8fm

114 13.1 79.7 44.8

Table I shows the results of these tests, which were obtainedby running the preconditioned solver for a single-thread on aIntel i7-9700 CPU with 16 GB of RAM. Two exit conditionswere used: the more precise exit condition requiring theabsolute value of the residue being reduced by − of itsinitial value, and the less precise exit condition requiring aresidue norm reduction of − . Not shown in the table arethe number of iterations in which the solvers converged, whichin all matrices except for bcsstk25 were lower when ILU0preconditioning was used over the other two preconditioningoptions. However, a lower iteration count does not directlyresult in a higher solver performance. The ﬁrst reason forthis is the time the pre-processing of the preconditioner takes,which accounts for between 60 % and 90% of the time takenby the lower-accuracy solver with ILU0 preconditioner. The lgorithm 3 Pseudo-code of the applying of the ILU0 pre-conditioner

Input : LUmat, x(cid:46)

LUmat is an NxN sparse matrix in the CSR format (cid:46) x is an N element array of ﬂoating point values

Output : p (cid:46) p is an N element array of ﬂoating point values for ≤ i < N do p [ i ] = x [ i ] for LURowPointers [ i ] ≤ v < LURowPointers [ i + 1] do pIndex = LUcolInds [ v ] if pIndex ≥ i then break end if p [ i ] = p [ i ] − LUnnzs [ v ] ∗ p [ pIndex ] end for end for for N > i ≥ do diagIndex = LURowPointers [ i + 1] for LURowPointers [ i + 1] > v ≥ LURowPointers [ i ] do pIndex = LUcolInds [ v ] if pIndex ≤ i then diagIndex = v break end if p [ i ] = p [ i ] − LUnnzs [ v ] ∗ p [ pIndex ] end for p [ i ] = p [ i ] / LUnnzs [ diagIndex ] end for Jacobi precoditioner has barely any pre-processing time, and ispowerful enough to be preferred over no preconditioner for allbut one matrix, and preferred over ILU0 for all matrices whenthe solver’s precision is low. Only when the solver’s precisionwas set higher, did the ILU0 preconditioner’s power show: thenumber of iterations required to reach this accuracy was highenough that the pre-processing time no longer dominated therun time. For most matrices, the ILU0 preconditioner reducedthe iteration count enough that the solve performance wasbetter than that of the other preconditioner options tested. Thebcsstk25 matrix being an exception, as its solve was only ac-celerated by the Jacobi preconditioner. We suspect the matrix’narrow banded structure and high values on the diagonal makeit particularly well-estimated by the Jacobi preconditioner.The second reason for why a lower iteration counts doesn’ttranslate to a higher solver performance directly is not capturedin our software tests. That is, different preconditioners mightbe easier or harder to parallelize; therefore, making them moreor less attractive to be accelerated on hardware respectively.The single vector calculation that is the applying of the Jacobipreconditioner is easy to parallelize, and therefore attractivefor hardware acceleration. ILU0 application on the other handis not that inherently parallel, but can be made more parallel when using graph coloring.

D. Matrix Reordering

As covered in section II-C, the ILU0 application functiondoes not inherently have a lot of parallelism for a hardwareaccelerator to exploit. However, by reordering the matrix, someparallelism in this function can be obtained. The goal of sucha reordering is to group rows together that do not depend onone another (i.e., that do not have non-zero values in columnsthat have the same index as the row indices of other rows inthe group). During our work, we have focused on two ways ofdoing such a reordering: level scheduling and graph coloring.

1) Level Scheduling:

Level scheduling (LS) is a way ofre-ordering a matrix by moving the rows and the columns ofthe matrix. That is, the rows are grouped together in colorsthat can be operated on in parallel, and then the columnsare reordered in the same way to keep the matrix consistentin the (re)order of the unknowns, which is required duringILU decomposition. LS does not only group independentrows together into ’levels’, but that also keeps the order ofdependencies of the original matrix intact [14]. In other words,if a certain row j that depends on row i comes after row i inthe original order, then after level scheduling, row j ls (therow to which row j was moved during level scheduling) willstill come after i ls for all possible values of i and j. A level-scheduling algorithm will go through the matrix and gather allrows that do not depend on any other rows into the ﬁrst level.Then it will go through the matrix again, and group all rowsthat only depend on rows in level 0 into level 1. Then it willﬁll level 2 with rows only dependent on rows from levels 0and 1, and so on until every row has been added to a level.

2) Graph Coloring:

Graph Coloring (GC) is a term that weuse to describe the reordering of matrices into groups (’colors’)of independent rows that do not require the maintaining theorder of dependencies in a matrix. To be more speciﬁc, we usethe Jones-Plassmann algorithm [15] to do graph coloring. Thisalgorithm assigns a random value to every row in the matrix,and then groups all rows that have a higher number than allnot-yet-grouped rows it depends on into a single color, andkeeps repeating this process until all rows have been assigneda color. In this work, graph coloring was chosen over the othermethod because it is parallelizable and it allows more controlover the maximum amount of rows and columns in a color,which will alleviate some of the inherent hardware designlimitations. The disadvantage of graph-coloring is that it isnot guaranteed to ﬁnd the minimum number of colors, andthus the highest amount of parallelism because of its randomnature.

3) Reordering Scheme Selection:

Selecting one of the twomatrix reordering methods depends on the answer to howimportant it is to obtain a solution that is as close as possibleto the one obtained without reordering. If the answer is veryimportant , then level scheduling is the preferred reorderingscheme. However, when the answer is not important , becausean iterative solver is already estimating the result instead ofolving the system precisely, graph coloring can be used if theresult is close enough to the actual solution.

E. Sparstitioning

To increase the scalability of our design and allow it towork on larger matrices, we will use sparstitioning [16].Sparstitioning is a method for splitting both the matrix andthe vector of a matrix-vector operation into partitions, sothat no unnecessary data is read during the running of theoperation on each partition. However, we make two changesto the sparstitioning algorithm. Firstly, we modify how thepartitions are chosen to conform with the matrix reorderingscheme chosen. Concretely, every color/level will be onepartition. Secondly, we modify how the vector partitions arestored. When sparstitioning is used for an SpMV, the vectorthat is operated on is stored in a partitioned format, whichintroduces some duplication of the vector data. However,storing the vector in this way is not convenient because thenon-partitioned vector is needed for the vector operations ofthe solver. Therefore, instead of storing the vector partitionsthemselves, our design stores an array that denotes for everypartition which indices of the vector are part of that partition.Consequently, only those values can be read from the vectorwhen computations start on that partition.Therefore, because of the matrix reordering and sparstition-ing optimization required to design a HPC iterative solver,we modify the standard CSR matrix storage format in twoways. Firstly, the number of non-zeroes, rows, and vectorpartition indices used in each partition of the matrix need tobe known. These will be collectively referred to as the sizes ofeach color. Secondly, the vector partition indices themselvesare needed, so that the solver knows which values of thematrix are accessed by each color. To account for these CSRmodiﬁcations, we introduce our own format, CSRO, which wewill formally introduce in section IV-B.III. R

ELATED WORKS

A. Sparse Matrix-Vector Multiplication on FPGAs

As described in section II-A, in this paper we focus onConjugate Gradient (CG) solvers, with a focus on its generalversion (i.e., not only for symmetric matrices as CG is),BiCGStab. Previous research into accelerating (parts of) CGsolvers using FPGA has focused until now mostly on accelerat-ing the SpMV kernel for one good reason: The SpMV functionhas a wide range of uses in a broad variety of domains, andbecause of its irregular memory accesses, it is not efﬁcientlyperformed by classical von-Neumann architectures like CPUs.This makes SpMV a perfect candidate to be acceleratedusing customizable dataﬂow architectures that can fully exploitapplication-speciﬁc conﬁgurable memory hierarchies. Further-more, in the context of linear algebra iterative solvers, SpMVtakes up most of time. Consequently, a plethora of previouswork addressed the topic of accelerating SpMV on FPGAs.Therefore, we only list a few of the performant designs withouttrying to be exhaustive. • The SpMV unit proposed by Zhuo and Prasanna [17]achieves a performance of between 350 MFLOPS and2.3 GFLOPS, depending on the number of processingunits, the input matrix, the frequency, and the bandwidth.However, to realize their design, the authors use zero-padding and column-wise blocking of the matrix. Thesedesign choices make SpMV less efﬁcient due to storageoverhead that becomes critical for large matrices andvery inefﬁcient to use in combination with an ILU0preconditioner. • Fowers et al. [18] implemented a SpMV unit that uses anovel sparse matrix encoding designed with the fact thatthe matrix will be divided over multiple processing unitsin mind, and a banked vector buffer that allows manymemory accesses to the input vector at the same time.That implementation reaches a performance of up to 3.9GFLOPS, but does not actually outperform GPU or CPUimplementations for large matrices. • Dorrance et al. obtain a performance of up to 19.2GFLOPS [19] for their CSC based SpMV accelerator,which is as far as we could ﬁnd the highest performanceof any FPGA-based implementation of SpMV. This highperformance was thanks to the high computational ef-ﬁciency of the implementation paired with its 64 pro-cessing units and high bandwidth (36 GB/s). However,as can be seen in algorithm 3, in the applying of theILU0 preconditiner, the result of one row is needed tocalculate the result of ones after it. So, unlike the SpMV,the apply ILU0 can not be done as easily with CSC as itcan with CSR. In fact, the CSC is completely unsuitableto use in conjunction with ILU0, and as a result we willnot use that storage format in this paper.Please note that the scope of the paper is not to design thefastest SpMV unit possible, but also to integrate this in a largerand complex application that is a preconditioned linear solver.Consequently, although some of the design choices might besub-optimal for the design of the fastest

SpMV (e.g., usinga CSR like format instead of the more parallelizable CSCformat), which is thus not a primary goal, we focused on thecomplete integration and looked for the best SpMV designoptions from a system-level view, and not SpMV in isolationas most of previous work considers. Nevertheless, the SpMVdesigned and implemented in this work is on par with thebest SpMV solution we could ﬁnd in the literature when weconsider normalized resources and bandwidth as comparisonparameters. Our SpMV is able to achieve up to 4.5 and 8.5GFLOPS in double and single precision, respectively.

B. Sparse Matrix CG Solvers on FPGAs

Research into designing a complete CG solver on an FPGAhas been done as well. However, probably due to the complex-ity of iterative solver algorithms coupled with a limited area,small on-chip memory, and no HBMs available on previousgeneration FPGAs, we could ﬁnd only a handful of relatedworks of solvers on reconﬁgurable hardware.

Morris et al. [20] implement a CG solver on a dual-FPGAusing both a HLL-to-HDL compiler and custom ﬂoatingpoint units written in VHDL. Despite the fact that usingthe HLL-to-HDL compiler is probably less optimizedthat writing VHDL directly, they obtain a speed-up oversoftware of 30%, showcasing the potential of FPGA inaccelerating sparse CG solvers. • Wu et al. [21] implemented a CG solver on an FPGAconsisting of an efﬁcient SpMV unit, a vector update unit,and a diagonal preconditioner. They obtain a speedup be-tween 4 and 9 times over their software implementation. • Chow et al. [22] introduce a novel way of staticallyscheduling accesses to the on-chip memory in a CGsolver on an FPGA. Despite the initial cost of performingthis scheduling, an average speedup of 4.4 over a CPUsolver, and of 3.6 over GPU one is achieved.However, different from the work in this paper, all of thisprevious research focused on small scale matrices that are notenough for a HPC real-world application setting. Furthermore,another thing that all of these implementations share is that noor only a very simple preconditioner has been used. Precondi-tioners can improve the performance of a solver signiﬁcantly,as they reduce the amount of iterations that the solver needs torun until it reaches its desired precision signiﬁcantly. Despitecosting additional time to perform each iteration, the ILU(0)preconditioner used in this work improves the performance ofall kinds of CG solvers over cases without preconditioners orwith the diagonal preconditioner [23]. Consequently, not usinga preconditioner in an iterative solver for a HPC real-worldapplication that requires very large and sparse matrices, is notan efﬁcient approach. Finally, using the CG solver limits theapplicability of the design to only problems that use symmetricmatrices to describe the physical system.IV. S P MV MULTIPLICATION

A. Design Objective

The main objective of our SpMV kernel on the FPGA is toensure that the unit has a high computational efﬁciency, whichis required to fully exploit the HBM available on the state-of-the-art FPGA generations such as the Xilinx Alveo U280[24] data center accelerator card. The remainder of this sectionhighlights the design choices of the SpMV implementation forthe HPC domain when using the latest HBM-enhanced FPGAaccelerators.

B. The Matrix Format

As described in section III-A, we have chosen to base ourkernel on the row-major CSR matrix storage format. Further-more, CSR is also the default format used in

Flow , a real-world HPC software application that we will use in sectionVII to validate our results and benchmark the performance.However, this format poses a challenge for any HPC FPGA-based design, namely because the CSR contains one integerper matrix row to describe how many values are each row.Since our design objective is to maximize throughput of theSpMV unit, we want to feed as many non-zero values and column indices into the SpMV unit as it can sustain on everyclock cycle. Unfortunately, it is not straight-forward whenusing the CSR’s rowPointers to determine to which row eachvalue belongs to. The number of rows a given number ofnon-zero values in a matrix belong to can vary wildly basedon the matrix’s sparsity pattern. Consequently, the number ofrowPointers that need to be read every cycle also varies, andalso depends on the values of the rowPointers of the cyclebefore it. For this reason, we introduce a novel sparse matrixstorage format based on CSR to be used by our accelerator.Similar to CSR, this new format contains the matrix’s non-zero values and their column indices as two arrays, but insteadof the rowIndices, it contains a new row offset for everynonzero value in the matrix. Because of the added offset array,this new format is tentatively called the

Compressed SparseRow-Offsets (CSRO) format. A value in the newRowOffsetarray is 0 if the non-zero value at that same index is in thesame row as the non-zero value before it. If a non-zero value isnot in the same row as the value before it, then the rowOffsetat the same index will be equal to 1 + the number of emptyrows between that non-zero value’s row and its predecessor’srow. To help illustrate how a matrix is stored in the CSROformat, ﬁgure 2 shows a matrix as it would be represented inthe CSRO format. For comparison, ﬁgure 1 shows how thatsame matrix would be represented in the CSR format. Notethat the NewRowOffset value of the ﬁrst value in the CSROformat is 1. This is a convention adopted to show that the ﬁrstrow starts with the ﬁrst value.

Fig. 1. An Example Matrix Stored in the CSR Format.Fig. 2. An Example Matrix Stored in the Novel CSRO Format.

1) The Blocked CSR:

The OPM platform and its ﬂowsimulator do not use regular CSR, but rather blocked CSR,in which each non-zero value in the matrix is not one value,but a 3x3 matrix of values. This is done to group all relevantphysical information about a single point of the reservoir thatthe matrix modelled together, but that is not relevant for thispaper. What is relevant, is that the nine values in the 3x3blocks are not easily used on an FPGA. Not only do theseblocks often contain zeroes (about half of all values in theblocks of the NORNE testcase are zeroes), but the odd numberf values in a block poses a problem when reading themover the ports of the FPGA or storing them in its internalmemories, all of which are build to consist of an even numberof values. Since the new matrix format that we introduce andthe sparstitioning already require extensive reordering of thenon-zero values of the matrix, we take this opportunity to alsoremove the blocking from the matrices in those steps. As aresult, the FPGA solver kernel primarily works on non-blockedmatrices. The only exception to this is the division at the endof the backwards substitution step in the apply ILU0. Here,a division by the value on the diagonal need to be done, andthis value on the diagonal is a block. These diagonal blocksare the only blocks that could not be removed, and as a resultthe hardware kernel still uses those blocks as inputs.

C. Design Overview1) The SpMV Pipeline:

The upper limit of the throughputof our SpMV unit design is determined by the number ofmultipliers in the unit. Thus, to achieve a high computationalefﬁciency, our design aims to stream values into those multi-pliers every cycle. Figure 3 shows how data is streamed intothe multipliers: non-zero values and column indices are readfrom the memories that hold them, and the column indicesare used as addresses at which to read elements from theinput vector, while the non-zero values are delayed by thedelay of the vector memory. In this way, the non-zero matrixvalues and the corresponding vector elements arrive in themultiplier at the same time. There is one memory that holdsthe vector partition for every two multiplier units, because theBRAM blocks of an FPGA have at most two read ports. Whichvalues of the vector are part of a certain partition is determinedby the vector partition indices of that partition. Those valuesare calculated by the software during pre-processing. Eachvector partition memory contains the entire vector partition,so the data is replicated and stored in each one. While not themost space-efﬁcient solution, this does allow a high number ofsimultaneous memory accesses at different addresses, whichis what the SpMV unit requires.

Fig. 3. A High-Level Schematic of the SpMV Unit.

After the multiplications have ﬁnished, the SpMV unit needsto add together the multiplication results that belong to the same matrix row together to obtain the SpMV result of thatrow. For this, we designed a selective adder tree (SAT) anda reduce unit, which receive instructions about which valuesto add together from a control unit. This control unit in turnobtains this information from the new row offset data. Theselective adder tree adds all the values in a single cycle thatbelong to the same row together, while the reduce unit addstogether the SAT results from different clock cycles that belongto the same row. SAT results that already represent the ﬁnalresult of a row (i.e. if all the values of that row were in thesame clock cycle) do not need to go into the reduce unit.Finally, a merge unit merges all of the results of the addertree and reduce unit into a set number of output ports thatwill be written to a result memory.

2) SpMV Top Level:

While the SpMV pipeline is the mostimportant part of the SpMV unit, as it does the calculations,some additional units are needed to keep the pipeline runningcorrectly. These units are as follows: • The external read unit handles all reads from outside ofthe FPGA chip. At the start of the SpMV run it reads thesizes of all colors, and uses these to read all matrix datathat is needed during the SpMV. • For the SpMV unit to be able to use sparstitioning withvector partition indices, the vector values at those indicesneed to be able to be read quickly. To achieve this, ourFPGA design contains a single large URAM memory tostore the multiplicant vector once. Due to the current sizechosen for this memory in the design, up to 262144 valuescan be stored, meaning that a matrix may not have morethan 262144 columns to be able to be multiplied by thekernel. The internal read unit coordinates reads from thismemory. It receives the vector partition indices of thecurrent color from the external read unit, reads the vectorvalues at those indices, and sends those values into vectorpartition memories of the SpMV pipeline. • The results of the SpMV pipeline can be out of order,as the time some result spend in the reduce and mergeunits are different than others. The write unit stores theresults of the SpMV pipeline in a memory block that hasbeen cyclically partitioned, so that it can write multipleresults into it and at the same time still read from it. Italso keeps track of up to which index all results havebeen received. Whenever it has received enough valuesto ﬁll up a cacheline, it will queue up that line to be sentto the HBM memory.An overview of the SpMV top-level unit is shown in ﬁgure4. Not shown in this ﬁgure is the control logic around it thatinstructs the units how much data to read or write and to/fromwhere. To do this, the control logic keeps track of which colorit is currently operating on and how many still remain.

3) Non-partitioned SpMV:

One of the tasks the SpMV top-level unit must perform is the reading of the vector partitiondata that the sparstitioning adds to the matrix data. For alarge part, this read operation can be done as look-ahead,meaning that while the SpMV pipeline operates on one color,the vector partition data of the next color can already be read ig. 4. Schematic SpMV Top-level from the URAM. This data is then stored in a BRAM blockin the internal read unit. However, after the SpMV pipelinecompletes, this vector partition data needs to be transferred tothe vector partition memories of the SpMV pipeline, and thevector partition indices for the next color need to be read fromthe off-chip memory. This is overhead added by our choiceto use sparstitioning. To determine how much this overheadslows down the overall performance, we added the option tothe design to not use sparstitioning. This version of the designhas as downside that it has a more limited maximum matrixsize that it can operate on, as the complete multiplicand vectorneed to be stored multiple times in the vector memories ofthe SpMV pipeline. The advantage is that these additionalsteps of reading vector partition indices and transferring vectorpartition data no longer need to occur.To visualize the difference in way of operating betweenthe sparstitioned and the non-sparstitioned SpMV unit, ﬁgure5 shows the steps the sparstitioned SpMV unit takes, whileﬁgure 6 shows those the non-sparstitioned SpMV unit takes.Tasks horizontally next to one another in these ﬁgures areperformed at the same time. Obviously, the ﬂow in ﬁgure 6is much simpler. Also note how in ﬁgure 5 the reading of thenext vector partition (and the reading of the indices to readfrom) are already done while the SpMV unit is still workingon another partition. This hides the time these reads wouldnormally take, but also necessitates an additional memory toread the next vector partition into, and a step in which the datais transferred from this memory into the SpMV unit.

D. GPU SpMV kernels

For the GPU, kernels in CUDA and OpenCL are tested.Only doubles are used for nonzero values. NVIDIA’s cuS-PARSE [25] features a blocked SpMV function called cus-parseDbsrmv(). For OpenCL, algorithm 1 from [26] is used.In this algorithm, a warp of 32 threads is assigned to a singleblock row, and threads are assigned to elements in the block

Fig. 5. Overview of the Tasks in the Sparstitioned SpMV Unit.Fig. 6. Overview of the Tasks the Non-Sparstitioned SpMV Unit. row. To cover the entire block row, a warp iterates, handling f loor (32 /bs ) blocks at a time, where bs is the blocksize.A warp only operates on complete blocks, so some threadsmight be always idle. For 3x3 blocks, a warp can cover 3blocks with 27 threads, and has 5 inactive threads. Duringinitialization, a thread calculates the index of the block row,how many blocks the row has, and which block it is supposedto operate on. It also calculates its position within a block(row and column). Since a warp only covers complete blocks,a thread’s assigned position in a block never changes. Duringiterating, it multiplies the target element from A with the valuefrom x, adding the product to a running total. Once the wholeblock row is complete, threads with the same vertical position(row) in the blocks reduce the values in their local registersnd write the reduced output to global memory. Reduction canbe done with shufﬂing or via shared memory.V. S PARSE M ATRIX SOLVER DESIGN

To expand the SpMV kernel into a kernel that can performpreconditioned BiCGStab, units to perform both the precon-ditioning and the vector operations need to be added. In thissection, the design of those units is discussed, followed bya description of the complete design that links all units andperforms the complete preconditioned solve.

A. Applying the ILU0 Preconditioner

As shown in algorithms 2 and 3, the basic structure of theSpMV and the forward and backward substitutions that theILU0 application performs are similar. This, combined withthe fact that the SpMV and ILU0 application are never activeat the same time, since the result of one is needed to startthe other, makes re-using the hardware of the SpMV unit forthe ILU0 apply a desired feature to save resources. Unlike theSpMV, which goes through the matrix it operates on line-by-line, the ILU0 application goes through its LU matrix in a no-sequential order. First, it goes through all of the values belowthe diagonal top-to-bottom, and then it goes through all valuesabove and including the diagonal bottom-to-top (last row ﬁrst).This memory access pattern is not efﬁcient to use on the FPGAbecause it goes through the LU matrix skipping the second halfof the values of each row during the forward substitution, sinceit only uses the lower triangular matrix there. Then, it goesback up through the matrix only using the second half of thevalues, since it only needs the upper triangular matrix duringthe backward substitution. Consequently, it needs to read allLU matrix data twice, ignoring about half of it each time.Which data it needs to use and which it doesn’t depends onthe current row number and the column index. This memoryaccess pattern is not speciﬁcally inefﬁcient for the FPGA, butjust in general, so in our design the lower-triangular matrix L,the upper triangular matrix U, and the diagonal values aresplit from the LU matrix, and used as three separate datastructures. First, L is read to perform the forward substitution,and then U and the diag vals are read to perform the backwardsubstitution. Since the latter step happens bottom-to-top, boththe U matrix and the diag vals are stored in reverse order.The main difference between the substitutions and theSpMV is that the substitutions modify the vector that theyare operating on, while the SpMV doesn’t. All rows that aregrouped into the same color by graph coloring can be in thepipeline at the same time, but between the colors, the resultsof the previous color need to be integrated into the vectorthat is currently being operated on. To achieve this, the resultsof the ILU0 application are not only written to the URAMvector memory, but also forwarded directly into the vectorpartition memories so that they can be used during the nextcolor, without requiring the unit to wait with reading its vectorpartition until the previous color ﬁnishes its computation.There are additional operations that the forward and back-ward substitution needs to apply over the SpMV. For the forward substitution, this is a subtraction of the SpMV unitresult from the p vector, and for the backward substitution, itis this subtraction followed by a division by the diagonal value.The ILU0 unit is added to the SpMV unit to perform theseoperations. The reading of the P vector can be done directlyfrom the URAM, since that is where the P vector is storedduring the ILU0 application. The diagonal vector is read fromoff-chip memory by the external read unit. Figure 7 showshow the original SpMV top-level was modiﬁed to allow theunit to also apply the ILU0 preconditioning step. Note thatthis unit is not yet the top-level of the complete solver.

Fig. 7. Schematic of the SpMV/ILU0 unit.

B. The Vector Operations

Over the course of a BiCGSTAB run, a number of vectoroperations of three different kinds need to be applied: • A dot product, which calculates the inner product of twovectors. • An axpy, which performs α ∗ (cid:126)x + (cid:126)y . • A norm, which calculates the absolute value of a vector(i.e. √ (cid:126)x ∗ (cid:126)x )Of these, the dot product and norm are very similar instructure, as a norm is the square root of an inner productof a vector with itself. All three operations are highly parallel,and perform the same basic operations (one multiplication andone addition) for every element in their input vectors. For thisreason, and to provide more ﬂexibility in exploiting the task-level parallelism in the solver, we designed a unit that canperform either an axpy or a dot product. This dot axpy unitcontains an equal number of multiply and add ﬂoating pointunits, which can be connected in different ways, depending onwhich function needs to be performed. When an axpy needto be performed, the adders and multipliers are connected inparallel pipelines that each perform the axpy on one elementof the input vectors at a time. For this function, an additionalinput port is used for the scaling constant α , besides the twoinput vector and one output vector ports. Figure 8 gives anoverview of how the adders and multipliers are arranged inthe dot axpy unit when it performs an axpy operation. ig. 8. Schematic Overview of dot axpy Unit in axpy Mode. For the dot products, all multipliers work in parallel, theresults of which are then added together into a single valueby a tree of adders. The result of this adder tree is addedtogether with the result of previous cycles in the ﬁnal adder,which adds a new input to its most recent output. After theadder tree has processed all values of the input vectors, logicaround the ﬁnal adder continuously feeds two subsequent validoutput of that adder back into it, until one singular ﬁnallyresult is left. Figure 9 gives an overview of how the adders andmultipliers are arranged in the dot axpy unit when it performsa dot product.

Fig. 9. Schematic overview of dot axpy unit in dot mode.

During the vector operations, a degree of task-level paral-lelism exists in the BiCGStab solver algorithm. This paral-lelism can be found in the following operations (using linenumbers from algorithm 1): • The vector subtraction in line 1 can be pipelined with theSpMV operation in that same line, so that the subtractionof each element of (cid:126)b happens as soon as the SpMV resultin the same position as that element is done. The dotproduct in line 5 can be pipelined with the subtraction ofline 1 in the same way. • The two axpy operations in line 9 can be pipelined. • The dot product in line 12 can be pipelined with theSpMV in line 11. • The axpy operations in lines 14 and 15 can be done inparallel, and the dot product in line 16 can be pipelinedwith the axpy in line 15. • The two dot products in line 19 can be done in parallel,and can both be pipelined with the SpMV in line 18. • The axpy operations in lines 20 and 21 can be done inparallel. The dot product in line 22 can be pipelined withthe axpy in line 21, and so can the dot product in line 7in the following iteration.Together, these instances of possible task-level parallelismcover all vector operations that the BiCGStab performs. Thenumber of vector operations that can be performed in parallelin each of these instances is, in order: 2, 2, 1, 3, 2 and 4. Thecomplete vector unit contains multiple dot axpy units to ex-ploit this task-level parallelism, but to have the dot axpy unitsthat it doesn’t need to be idle too often, and to not increasethe number of read/write ports too much (see section V-C), wechoose to implement 2 dot axpy unit in the vector ops unit.

C. Read and Write Ports

The Xilinx Alveo U280 data center accelerator card hastwo different types of large memory that the FPGA chip isconnected to: an off-chip DDR memory that consists of two16 GB DDR4 memory banks, each of which is connected tothe FPGA via a single read/write port, and an 8 GB on-chipHBM memory stack that is connected to the FPGA via 32read/write ports. Each memory port is accessed at the kernellevel either via a 512-bit wide AXI4 interface (for the DDRmemory) or via a 256-bit wide AXI3 interface (for the HBMmemory). This memory architecture has led to the followingdesign decisions: • Reading and writing delays from/to the HBM are slightlyshorter than those from/to the DDR memory, so allvectors (which all need to be read and written multipletimes during the solver run) are stored in the HBMmemory. • To reduce the number of memories to which the hostCPU needs to transfer its data, and to thereby reduce theinitial data transfer time, all matrix data and the initial Bvector are stored on the DDR memory. • Some vector operations read from and write to the samevector, for example the axpy in line 14 of algorithm 1.To avoid conﬂicts and to make data buffering easier, eachAXI port is used either to read or write, hence the x, r, andp vectors are stored in two memory locations connected todifferent ports. During an operation to update one of thosevectors, it is read from the port that it was written to mostrecently, and written to the other port. This essentiallymakes use of the ping pong buffering technique used tostream data efﬁciently. • All HBM ports are located on one of the two narrow sidesof the FPGA chip (being the shape rectangular), wherethe HBM crossbar and memory stack is located, so, toavoid routing congestion, we choose to limit the numberof HBM ports. We chose to use six HBM ports, becausethis number of read and write ports can be efﬁcientlyused during the vector operations. • Since the width of the read ports is 512 bits, we designour computation pipelines to be able to accept one fullcache line every cycle. For the HBM ports, which arenatively 256 bit wide, we let the implementation toolso handle the AXI port width conversion. That meanseach dot axpy unit has 8 double-precision multipliers andadders.

D. The Complete FPGA Solver

A top level solver unit was designed to encompass boththe SpMV/ILU0 unit and vector ops unit. This top-level unitinstructs those units which tasks to perform and with whichinput/output vectors. It also regulates the access of both unitsto the URAM internal vector memory and handles ﬁlling ofthis memory during the initialization step of the solver. Figure10 gives a schematic overview of the top-level unit of theBiCGStab solver kernel. To not overly complicate the ﬁgure,the signals between the units and the memory are not pictured.Besides the units already covered in the previous sections, theﬁgure also shows a block of ﬂoating point operators and amemory for ﬂoating point variables. The operators are used toapply the operations that only need to be performed on a singlevalue at a time, such as the square root of the norm calculationor the multiplications and division of the β calculation inline 8 of algorithm 1. The memory is used to stored theresults of these operations, and contains α , β , ω , ρ , ρ new andconv treshold. Fig. 10. Schematic Overview of the Complete Solver Design.

E. The GPU Solvers

For completeness of the evaluation and comparison of ourwork, we also developed a couple of GPU solvers using ﬁrstCUDA and the NVIDIA’s cuSPARSE library [25]. cuSPARSEhas functions for vector operations like dot, axpy and norm,as well as a blocked ILU0 preconditioner. Second, we alsoimplemented the solver using OpenCL [27], where a similarstructure to the CUDA is used, deﬁning kernels for vectoroperations and applying the ILU0 preconditioner. One bigdifference is that the construction of the ILU0 preconditioner(the decomposition) is done on the CPU. This has not beenimplemented on GPU yet, whereas CUDA features an ILU0decomposition on the GPU in cuSPARSE.VI. P

ERFORMANCE M ODELING

To predict the performance of both the hardware design as itwas in development as well as to gauge how well the design scales, we constructed a performance model that we run inMatlab. This model was made to emulate the data ﬂowingthrough the computation units and to calculate how manyclock cycles the execution would take. This section describesﬁrst what parts of the design are modeled and which are not,showing second the predicted modelling results.

A. Model Details

The performance model is composed of functions thatmodel the individual hardware kernels: it has a function to esti-mate the performance of the SpMV, one for the ILU0, and onefor the vector operations. To make the performance estimationsas accurate as possible, the hardware models work on the samedata that is sent to the kernel, including the partitioning data.The exception being that, since the model does not actuallyperform the solving calculation, it does not read the non-zeromatrix values or any vector data. Nevertheless, the partitioningand sparsity pattern information are needed to calculate theperformance. Furthermore, the functions that calculate theperformance of the ILU0 apply and SpMV operations sharethe same structure, in the same way that those operations sharethe same module in the hardware kernel. These functions dothe following for every color: • First, it calculates the time to transfer the vector partitiondata into the SpMV unit. It does this by dividing thenumber of values in the partition by the number of portsallocated to the internal URAM memory from which thepartition is read. • If apply ILU0 is performed, the function calculates thetime the transfer of the P vector into the ILU0 unitmemory would take in the same manner. • Then, it calculates the time the SpMV pipeline takes.During this step, the function’s emphasis is on the timethe reading of the matrix data takes. It assumes a setbandwidth for each of three ports from which the threematrix arrays (non-zero values, column indices and newrow offsets) are read, and simulates reading into threeFIFOs at the speed set by that bandwidth. Wheneverenough data is available in the FIFOs for all three arraysto ﬁll an input line of the SpMV pipeline, that data isremoved from the FIFOs. • After enough data has been read from all three ports forthe current color, and the simulated FIFOs are empty, aset delay is added for the SpMV pipeline to write backthe result data. Since much less data needs to be writtenback than needed to be read, no delay beyond the regularwrite overhead of the ﬁnal lines is deemed necessary. • If apply ILU0 is performed, a set delay for the ILU0unit is added, and the write delay is calculated with theon-chip URAM instead of an off-chip memory.The cycle count results of all colors are added togetherto ﬁnd the total number of clock cycles that the SpMV orapply ILU0 takes. The calculations done to estimate the cyclecount of the vector operations are comparatively more simple,but follow the same method: ﬁrst the reading of data intoFIFOs at a set bandwidth is modeled, data is removed fromhe FIFO whenever enough is in it to ﬁll up a line of inputsof the vector operation unit, and ﬁnally, when all reading isdone, the latency of the pipeline of the operation is added, aswell as some writing overhead cycles for the axpy operation.A top-level function adds together the cycle count results ofapply ILU0, SpMV, and vector operations to get to a ﬁnalestimate. This top-level function has a number of variablesthat can be changed to do domain space exploration and toassess the scalability of the kernel. These are: • The number of multipliers and adders in both the vectoroperation and SpMV/ILU0 units. • The bandwidth to the off-chip memory in GB/s. • The bandwidth to the on-chip URAM in number of ports(how many values of the vector stored on-chip can beread simultaneously). • The delay of the adder and multiplier units • The matrix that is modeled for, and how long the solveof that matrix takes, in iterations.Please note that the largest source of performance uncertaintyin the design is the time the memory operations take. Memoryaccesses don’t happen at a constant pace, but in bursts, thesize of which and the time between them may very basedon various factors related to the memory. As a result, theweakest part of the model is that it does not accuratelycapture this memory behaviour. Similarly, it does not supportmodeling reading two arrays of data from the same port, northe performance penalties from reading and writing on thesame memory bank at the same time.

B. Modelling Results

We use the model to analyze the scalability of the designand to understand what are the critical points in achievingthe best performance. We perform a domain space explorationby varying the model parameters, starting with the number ofadders and multipliers set at 2, increasing the bandwidth from10 GB/s, and increasing the number of internal ports from2. When a parameter does not give anymore a performanceincrease, we switch to increase the new bottleneck parameter.We perform this exercise by increasing all parameters until amaximum value we select as an upper bound of a maximumpractical boundary for our design. Please note that even themaximum theoretical bandwidth is 460 GB/s, due to thecomplexity of the design which inﬂuence the mapping androuting of resources from/to the memory, we restrict thebandwidth to only 100 GB/s as parameter. Figure 11 showsthe modelling results.VII. E

XPERIMENTAL R ESULTS

Evaluation of the performance of our design is done in threestages. Firstly, the performance of the stand-alone SpMV unitis evaluated, followed by the performance of the completesolver stand-alone. Finally, the FPGA solver is integrated intothe ﬂow simulator, and its performance is compared to that ofa CPU and a GPU. All the FPGA bitstream implementationswere done in Xilinx Vitis 2019.2 and run on a Xilinx Alveo U280 data center accelerator card [24]. In this section, perfor-mance for the SpMV unit will be reported in Giga FLoating-point OPerations per second. These are calculated by dividingthe total number of FP operations performed by the time ittook to perform them. Therefore, the total number of requiredoperations is two times the number of non-zero values in thematrix that is being multiplied (since one multiplication andone addition need to be performed on each non-zero value).Table II lists the platforms we used to benchmark the kernelsand the ﬂow simulator. Each platform will be then referred inthe rest of the paper with its short name (e.g.

FPGA ). A. SpMV results

The SpMV unit has many conﬁgurable parameters thatwill impact both performance and FPGA resource usage. Forthis performance evaluation, we choose to compare betweensingle-precision (ﬂoat) and double-precision (double) imple-mentations, as well as between partitioned ( p) and non-partitioned ( np) implementations. Other parameters were setto the values in Table III.The resource utilization of the four chosen SpMV kernelimplementations is shown in Table IV. From this table, we canobserve that the BRAM utilization sets the limit of how oftenthis kernel could be replicated on this same FPGA device. Allimplementations are able to reach a frequency of 300 MHz,which is the maximum frequency at which the AXI memoryinterfaces can run.The sizes and types of matrices that were used to test theSpMV unit are listed in Table V. All of the listed matriceswere obtained from the SuiteSparse matrix collection [12],[13], apart for the NORNE matrix, which models a real-lifeoil ﬁeld, and is a testcase in the OPM framework. As such,this matrix was not obtained from the SuiteSparse matrixcollection, but instead by exporting it from the ﬂow simulator.Please note that all but one of the matrices are square, and ofthe square matrices, three of them are non-symmetric. Thesematrices were chosen to prove that the SpMV unit workson such non-symmetric and non-square matrices. Some ofthese matrices are not used as benchmarks for the BiCGSTABsolver, however, as it does not work for non-square matricesand has additional limitations as explained in the next section.In Figure 12, the performance for all the matrices used tobenchmark the kernel are shown. The corresponding tabulatedvalues can be found in Table VI. For the FPGA and GPU,those performance numbers do not include the data transfertime between the host CPU and the on-board memories. Fromthis ﬁgure, we can see that the single-precision versions ofthe FPGA kernel achieve almost 2 times more FLOP/s thantheir double-precision counterparts. This can be explained bythe fact that the amount of data that needs to be read for eachsingle-precision operations is half of the amount of data thatneeds to be read for a double-precision one. We also observe,as expected, that the non-partitioned version achieves a slightlyhigher performance than the corresponding partitioned version,at the trade-off that the non-partitioned version cannot operateon the largest of the benchmark matrices. ig. 11. Domain Search Exploration for the ILU0-BiCGStab Solver. One internal port has a maximum bandwidth of around 18 GB/s at the current kernelfrequency of 280 MHz. The conﬁguration achieved in this paper for the FPGA solver was 2 internal ports, 8 PUs, and 50 GB/s external bandwidth. Thiscorresponds to the point in the top middle ﬁgure on the parse line estimating a 1.6x speedup, which was conﬁrmed by the experimental results presented next.TABLE IIC

HARACTERISTICS OF THE PLATFORMS USED TO BENCHMARK THE KERNELS

Shortname Host AcceleratorCPU (

FPGA

Xeon E5-2640V3 (16) 2.6 GHz DDR4-2133 (4 / 59 GB/s) Xilinx Alveo U280 DDR4+HBM (498 GB/s)

GPU Xeon E5-2640V3 (16) 2.6 GHz DDR4-2133 (4 / 59 GB/s) Nvidia K40m GDDR5 (288 GB/s)

GPU Xeon E5-2698V4 (40) 2.2 GHz DDR4-2400 (4 / 77 GB/s) Nvidia V100-SXM2-16GB HBM2 (900 GB/s)

CPU

Xeon E5-2698V4 (40) 2.2 GHz DDR4-2400 (4 / 77 GB/s) – –TABLE IIIS P MV PARAMETERS OF DIFFERENT S P MV IMPLEMENTATIONS

Parameter double p double np ﬂoat p ﬂoat npNum mults 8 8 16 16Max rows 262144 65536 262144 65536Read ports 4 4 4 4

B. Stand-alone FPGA solver results

For the ILU0 BiCGSTAB solver, we are only interested in aversion of the solver that can run on larger matrices with highprecision, so only the double-precision, partitioned version ofthe solver is benchmarked. The design parameters as listed forthe double p SpMV implementation in Table III hold for thissolver implementation as well, except for the number of readports, which is increased by one.In Table VII, the resource utilization of the ILU0 BICGSTABsolver is compared to that of the SpMV kernel with the same

TABLE IVFPGA

RESOURCE UTILIZATION OF DIFFERENT IMPLEMENTATIONS OF THE S P MV KERNEL AS % OF THE RESOURCES AVAILABLE IN THE

FPGA’

SDYNAMIC REGION

Resource double p double np ﬂoat p ﬂoat npLUT 4.08 % 4.12 % 4.92 % 5.00 %LUTRAM 0.70 % 0.69 % 0.82 % 0.78 %REG 4.14 % 4.45 % 4.96 % 5.24 %BRAM 15.27 % 12.38 % 18.97 % 12.38 %URAM 4.17 % 6.67 % 2.50 % 6.67 %DSP 0.93 % 0.93 % 0.74 % 0.74 % parameters. In both cases, these values refer to the percentageof resources used over the available ones in the dynamic region(which is reconﬁgurable by the user), and they include allthe supporting circuitry to control the solver and access thememory on the FPGA board. Hence, not included in thesevalues are the resources used by the FPGA’s static region, ig. 12. SpMV performance. Please note that for the FPGA, the maximum aggregated HBM memory bandwidth is 460 GB/s, however, due to designlimitations caused by implementation issues, the maximum bandwidth used by the kernel is around 50 GB/s.TABLE VC

HARACTERISTICS OF THE MATRICES USED TO BENCHMARK THE S P MV KERNEL

Name Rows Columns NNZs Density% Symm.nos1 237 237 2115 1.811 yesepb1 14734 14734 95053 0.044 noHummocky 12380 12380 121340 0.079 yesbodyy6 19366 19366 134748 0.036 yeslp ken 18 105127 154699 358171 0.002 nogridgena 48962 48962 512084 0.021 yeswathen120 36441 36441 565761 0.043 yesﬁnan512 74752 74752 596992 0.011 yesqa8fm 66127 66127 1660579 0.038 yescrystm03 24696 24696 1751310 0.096 yesvanbody 47072 47072 2336898 0.105 yesNORNE 133293 133293 2818879 0.016 nocfd2 123440 123440 3087898 0.020 yes which is provided by Xilinx. Due to the increased complexityof the solver over the SpMV kernel, it could only reach afrequency of 280 MHz.Some of the matrices used to test the SpMV unit were notusable during the testing of the solver, because they were non-square, had zero values on their diagonals, or exceeded on-board memories of the design. Table VIII lists the matricesthat were used to test the BICGSTAB solver, along with theiroriginal sizes and the size of the derived L/U matrices usedby the FPGA solver.The performance results obtained after running the stand-alone ILU0 BiCGSTAB solver are shown in Figure 13. Foreach matrix, we present the best result obtained either byusing the graph-coloring or the level-scheduling reorderingalgorithms. The detailed timings are listed in Table IX, which contains the run time as reported by the proﬁling info (

Solver )of the runs (only for the FPGA, used in Figure 13), the runtime as measured by the

Host (used in the ﬁgure for the CPUand GPU results), and the time spent transferring (

Transfer )the input data and the output results between the host CPU’sand the accelerators’ on-board DRAM/HBM memories. Also,the number of iterations (

Iter ) needed to solve the systemare reported. The

Host time is, in all cases, higher than the

Solver time, as the former includes overhead of starting andwaiting for the end of the solver kernel. The transfer times arenot included in the other two times.In Table X we report the bandwidth utilization for eachmemory port as recorded by the proﬁling facilities present inthe FPGA solver. Each column shows a memory port, alongwith its mapping (either to DDR4 or HBM memory). Thebandwidth utilization is given a percentage of the maximumbandwidth available for each individual port, which is 18.8GB/s for ports read 0-1 (DDR4 memory, 512-bit AXI bus @300 MHz), while it’s 14.1 GB/s for the remaining ports (HBMmemory, 256-bit AXI bus @ 450 MHz).

C. Flow results

The ILU0 BiCGSTAB solvers were integrated into theﬂow simulator in OPM [10], by replacing the call to thesimulator’s own iterative solver with calls to a function thatdoes the required pre-processing, sends the resulting data tothe accelerator, and then reads the results from it after itis done. The source code used was from version 2020.10-rc4, with custom modiﬁcations to integrate our solvers. Theoriginal ﬂow solver is an ILU0 BICGSTAB solver basedon the DUNE project [28], and in Table XI, the perfor-mance results between this software solver and our solvers

ABLE VIP

ERFORMANCE IN

GFLOP/

S FOR THE S P MV KERNEL

Matrix FPGA GPU CPUdouble p double np ﬂoat p ﬂoat np cusparse OpenCL LS OpenCL GC (double)nos1 0.34 0.34 0.34 0.34 0.20 0.08 0.08 0.00epb1 2.50 3.17 3.57 5.76 14.62 – – 0.15Hummocky 2.79 3.64 4.05 6.11 15.17 5.92 6.22 0.29bodyy6 2.52 3.36 3.67 6.78 17.97 8.17 8.69 0.22lp ken 18 1.25 – 1.40 – – – – –gridgena 3.21 3.81 5.33 7.64 34.14 22.76 22.76 0.63wathen120 3.51 3.82 6.24 7.67 40.41 23.09 23.09 0.62crystm03 3.69 4.34 6.69 8.29 33.36 16.44 16.22 0.88ﬁnan512 2.89 – 3.95 – 21.71 17.56 15.92 1.41qa8fm 3.85 – 6.90 – 55.35 43.13 43.70 2.18vanbody 4.15 4.60 7.84 8.69 – – – –NORNE 3.59 – 5.90 – 90.93 74.18 77.23 1.81cfd2 3.80 – 6.89 – 57.18 43.80 43.49 6.04Fig. 13. Solver execution time for selected matrices. For readability of the CPU/FPGA results, some GPU results have been clipped. Refer to Table IX for theactual values. Please note that for the FPGA, the maximum aggregated HBM memory bandwidth is 460 GB/s and the maximum aggregated DDR4 memorybandwidth is around 37 GB/s, however, due to design limitations caused by implementation issues, the maximum bandwidth used by the kernel is around 52GB/s. TABLE VIIFPGA

RESOURCE UTILIZATION OF THE SOLVER COMPARED TO THE S P MV KERNEL AS % OF THE RESOURCES AVAILABLE IN THE

FPGA’

SDYNAMIC REGION

Resource SpMV SolverLUT 4.08 % 6.93 %LUTRAM 0.70 % 1.13 %REG 4.14 % 6.41 %BRAM 15.27 % 24.64 %URAM 4.17 % 4.17 %DSP 0.93 % 3.18 % are compared when running simulation on the NORNEtestcase (ﬁle

NORNE ATW2013.DATA ) [29] that is part ofthe OPM project. For all runs, the parameter ’–matrix-add-wellcontributions=true’ was added to the command line (tohave a fair comparison with the FPGA results, because the

TABLE VIIIC

HARACTERISTICS OF THE TESTING MATRICES

Name Dim. Non-zeroesFPGAmatrix FPGAL-matrix FPGAU-matrix GPUmatrixnos1 237 1017 390 390 1017bodyy6 19366 134197 121414 70762 134197gridgena 48962 513060 421068 315336 513060crystm03 24696 583770 279519 279522 583770wathen120 36441 565761 468529 379446 565761qa8fm 66127 1660564 1431908 1251238 1660564NORNE 133293 1314999 726369 513512 2818879

FPGA solver currently does not support separated well con-tributions). For the Nornes case, including well contributionswill increase run time on the CPU with approximately twentypercent compared to the default conﬁguration. Moreover, for

ABLE IXT

IMING RESULTS ( IN MILLISECONDS ) FOR SOLVING SELECTED MATRICES

Matrix FPGA CPU DUNE GPU cusparse OpenCL LS OpenCL GCSolver Host Transfer Iter EAD AND WRITE PORTS BANDWIDTH UTILIZATION OF THE

FPGA

SOLVER , AS % OF THE AVAILABLE BANDWIDTH

Matrix DDR4 HBM Aggregated readbandwidth (GB/s) % of max aggregatedread bandwidthRead 0 Read 1 Read 2 Read 3 Read 4 Write 0 Write 1 Write 2nos1 42.8 14.9 50.6 50.5 49.5 79.4 79.4 79.1 32.0 40.2bodyy6 82.6 36.8 70.8 64.4 72.4 100 100 100 51.6 64.7gridgena 81.7 36.2 70.0 65.8 70.2 100 100 100 51.1 64.1crystm03 79.8 43.2 68.3 69.2 69.3 100 100 100 52.2 65.4wathen120 80.6 35.6 68.9 67.0 70.0 100 100 100 50.7 63.7qa8fm 81.4 36.0 68.6 66.5 72.1 100 100 100 51.1 64.2NORNE 83.0 35.8 66.8 62.5 69.6 100 100 100 50.3 63.1 the FPGA run, the parameter ’–threads-per-process=8’ wasalso added. This will only impact the assembly part ofthe simulation, not the preconditioner nor the linear solver,in effect reducing the computational time spent outside thepreconditioning and linear solve part. The reordering algorithmhas been left to the default for FPGA (level scheduling),because graph coloring produces in this case more colors thanthe ones supported by the FPGA solver. For the executionsusing the FPGA solver, all the application’s threads werepinned to CPU e − was used for the results shownin Table XI and Figure 13, while a reduction of e − wasused for the results shown in Table XII. For an explanation ofthe characteristics of each platform used to run ﬂow, pleaserefer to Table II.To gain a better understanding of which parts of runningthe FPGA solver are causing it to be slower than the softwareversion, we timed the different functions applied by this solverduring a ﬂow run, the results of which are shown in Table XI.During the running of the ﬂow simulator, the sparsity patternof the matrix that needs to be solved does not change. Thismakes it possible for functions that depend only on the sparsitypattern, like the ﬁnding of the re-ordering and partitioningpattern, to be done only once during initialization. From TableXIII, we observe that the majority of solver time is actuallyspent running the FPGA solver, but a signiﬁcant portion of therun time is also taken up by the creation of the preconditioner,and the data transfers between the host and FPGA. VIII. C ONCLUSION

In this paper, we have evaluated the potential of usingFPGAs in HPC, which is a highly relevant topic because ofthe rapid advances in reconﬁgurable hardware. To perform thisstudy, we began by proposing a novel CSR based encoding tooptimize a new SpMV kernel on FPGA, which can be easilyintegrated with an ILU0 preconditioner. We subsequentlydeveloped a hardware model to predict and guide the designof the ILU0 preconditioned BiCGstab solver, which helpedus to understand the trade-offs between area and performancewhen scaling the resources. Next, we implemented the ILU0-BiCGStab preconditioned solver targeted at an HBM-enabledFPGA using a mixed programming model of HLS and RTLto maximize the performance of the double precision kernel.To validate our work, we integrated the complete solver in theOPM reservoir simulator, both for FPGA and GPU. Finally, weprovided extensive evaluation results for both the standaloneSpMV and solver kernels as well as the complete reservoirsimulation execution on a real world use case running on threedifferent platforms, CPU, FPGA, and GPU.We ﬁnd that the FPGA is on par with the CPU executionand 3x slower than the GPU implementation when comparingonly the kernel executions. Although the obtained resultsshow that an FPGA is not yet competitive with the GPU, webelieve that with a better on-chip support for double precision,increased number of computation units and memory ports,increased internal parallelism, as well as other optimizationson the host software side (i.e. reduction of preconditionercomputation time, increased efﬁciency of the host-FPGA dataexchange and task execution control), the FPGA can becomea better alternative to speed-up scientiﬁc computing. This hasbeen illustrated by scaling our hardware performance model

ABLE XIC

OMPARISON OF PERFORMANCE FOR FLOW WITH REDUCTION =1 E -2( DEFAULT ) USING

CPU, FPGA

AND

GPU (

IN SECONDS )Function CPU FPGA GPU GPU cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GCTotal time 540.6 751.8 489.7 799.6 749.3 320.1 507.6 458.5Solver time 540.6 751.7 489.6 799.5 749.2 320.0 507.6 458.5Assembly time 99.4 94.6 133.7 137.5 163.6 115.6 112.7 114.7Well assembly time 43.0 47.1 21.15 21.6 26.5 13.9 14.6 16.8Linear solve time 339.5 559.9 248.9 551.4 460.6 117.6 293.6 207.8Linear setup time 41.8 45.3 38.1 39.5 47.7 33.6 35.4 43.6Update time 57.3 54.4 67.3 70.1 84.5 53.6 56.9 71.2Output write time 3.5 3.5 3.2 3.2 3.2 2.9 2.9 3.0Number of linear solves 1458 1441 1437 1507 1818 1449 1507 1861Number of linear iterations 21991 21012 21282 22626 42160 21020 22626 40991Speedup total time 1.00 0.72 1.10 0.68 0.72 1.69 1.07 1.18Speedup linear solver time 1.00 0.61 1.36 0.62 0.74 2.89 1.16 1.63TABLE XIIC OMPARISON OF PERFORMANCE FOR FLOW WITH REDUCTION =1 E -6 USING

CPU, FPGA

AND

GPU (

IN SECONDS )Function CPU FPGA GPU GPU cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GCTotal time 1254.1 1512.9 900.0 1893.0 1333.3 484.5 950.5 579.4Solver time 1254.0 1512.9 900.0 1893.0 1333.2 484.5 950.5 579.3Assembly time 82.2 84.9 112.0 114.7 115.9 101.8 101.7 101.7Well assembly time 36.1 42.3 17.1 17.5 17.6 11.9 12.1 12.5Linear solve time 1081.3 1333.8 691.4 1680.1 1119.5 303.8 768.8 397.0Linear setup time 35.2 38.9 31.1 31.4 31.3 27.8 28.2 28.7Update time 46.1 50.9 56.9 57.4 57.3 45.6 45.6 45.8Output write time 3.5 3.5 3.0 3.2 3.2 2.8 2.9 2.9Number of linear solves 1207 1207 1202 1207 1207 1207 1207 1207Number of linear iterations 79163 78275 78447 82673 117463 78958 82673 117294Speedup total time 1.00 0.83 1.39 0.66 0.94 2.59 1.32 2.16Speedup linear solver time 1.00 0.81 1.56 0.64 0.97 3.56 1.41 2.72TABLE XIIIB REAKDOWN OF LINEAR SOLVE TIME ( IN SECONDS ) SPENT RUNNING THE SOLVERS IN FLOW WITH REDUCTION =1 E -2 ( DEFAULT )Device Step (accumulated time) CPU FPGA GPU GPU cusparse OpenCL LS OpenCL GC cusparse OpenCL LS OpenCL GCHost Reorder vectors - 1.10 - 1.08 1.14 - 1.14 1.30Create preconditioner 38.5 128.75 22.27 60.09 95.35 5.76 58.00 81.00BILU0 reorder matrix - - - 20.47 27.72 - 21.30 23.60BILU0 decomposition - - - 31.42 57.34 - 30.20 49.60BILU0 copy to GPU - - - 5.04 5.97 - 3.79 4.63Memory setup - 10.61 - - - - - -CPU to accelerator transfer - 65.95 4.70 5.10 6.18 3.83 4.46 5.59Accelerator Total solve 274.1 304.90 175.10 448.50 324.10 71.80 197.30 88.10ILU apply 112.6 - 134.50 395.10 68.00 63.70 182.30 62.30SpMV 127.3 - 28.40 37.30 25.50 2.60 4.40 7.70Rest 28.9 - 8.50 12.70 223.80 4.00 8.30 14.20Host Accelerator to CPU transfer - 0.76 0.27 0.35 0.16 0.26 0.34 0.16 beyond what it was possible to achieve in this work dueto the encountered hardware limitations. In the future, wewill work to eliminate these restrictions and provide a moreefﬁcient implementation. An alternative research could focuson developing other preconditioned solvers that might be moresuitable for an FPGA implementation.To encourage research into this direction, the source codesfor the RTL kernels and the integration with OPM Flow aremade available in the OPM git repository [10]. A pull request was created ( pull request was created against the most current version of the code atthe time of writing (December 2020), the performance resultsobtained with that code may not be directly comparable withthe results reported in this paper, which were gathered with aprevious release of OPM’s

Flow as stated in section VII-C.

EFERENCES[1] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen,H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A surveyand evaluation of fpga high-level synthesis tools,”

IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems

Computers & Mathematics with Applications

SIAMJournal on Scientiﬁc and Statistical Computing , vol. 13, no. 2, pp.631–644, 1992. [Online]. Available: https://doi.org/10.1137/0913035[9] E. Chow and A. Patel, “Fine-grained parallel incomplete lufactorization,”

SIAM Journal on Scientiﬁc Computing , vol. 37, no. 2,pp. C169–C193, 2015. [Online]. Available: https://doi.org/10.1137/140968896[10] “Open porous media reservoir simulator.” [Online]. Available: https://github.com/OPM[11] A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically tuningsparse matrix-vector multiplication for gpu architectures,” in

High Per-formance Embedded Architectures and Compilers , Y. N. Patt, P. Foglia,E. Duesterwald, P. Faraboschi, and X. Martorell, Eds. Berlin, Heidel-berg: Springer Berlin Heidelberg, 2010, pp. 111–125.[12] “Suitesparse matrix collection.” [Online]. Available: https://sparse.tamu.edu[13] T. A. Davis and Y. Hu, “The University of Florida Sparse MatrixCollection,”

ACM Transactions on Mathematical Software , vol. 38, no. 1,2011.[14] Y. Saad,

Iterative methods for Sparse Linear Systems . Society forIndustrial and Applied Mathematics, 2003.[15] M. T. Jones and P. E. Plassmann, “A parallel graph coloring heuristic,”

SIAM Journal of Scientiﬁc Computing , vol. 14-3, pp. 654—-669, 1993.[16] B. Sigurbergsson, T. Hogervorst, T. D. Qiu, and R. Nane, “Sparstition:A partitioning scheme for large-scale sparse matrix vector multiplicationon fpga,”

International Conference on Application-speciﬁc Systems,Architectures and Processors (ASAP) , vol. 30, pp. 51–58, 2019.[17] L. Zhuo and V. K. Prasanna, “Sparse matrix-vector multiplication onfpgas,” in

Proceedings of the 2005 ACM/SIGDA 13th internationalsymposium on Field-programmable gate arrays . ACM, 2005, pp. 63–74.[18] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt,“A high memory bandwidth fpga accelerator for sparse matrix-vectormultiplication,” in . IEEE, 2014,pp. 36–43.[19] R. Dorrance, F. Ren, and D. Markovi´c, “A scalable sparse matrix-vector multiplication kernel for energy-efﬁcient sparse-blas on fpgas,”in

Proceedings of the 2014 ACM/SIGDA international symposium onField-programmable gate arrays . ACM, 2014, pp. 161–170. [20] G. R. Morris, V. K. Prasanna, and R. D. Anderson, “A hybrid approachfor mapping conjugate gradient onto an fpga-augmented reconﬁgurablesupercomputer,” in . IEEE, 2006, pp. 3–12.[21] G. Wu, X. Xie, Y. Dou, and M. Wang, “High-performance architecturefor the conjugate gradient solver on fpgas,”

IEEE Transactions onCircuits and Systems II: Express Briefs , vol. 60, no. 11, pp. 791–795,2013.[22] G. C. Chow, P. Grigoras, P. Burovskiy, and W. Luk, “An efﬁcientsparse conjugate gradient solver using a beneˇs permutation network,”in . IEEE, 2014, pp. 1–7.[23] E. Topsakal, R. Kindt, K. Sertel, and J. Volakis, “Evaluation of thebicgstab (l) algorithm for the ﬁnite-element/boundary-integral method,”

IEEE Antennas and propagation Magazine , 2016, pp. 663–672.[27] A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg,