[PDF] Fast Parallel Newton-Raphson Power Flow Solver for Large Number of System Calculations with CPU and GPU

Abstract

To analyze large sets of grid states, e.g. when evaluating the impact from the uncertainties of the renewable generation with probabilistic Monte Carlo simulation or in stationary time series simulation, large number of power flow calculations have to be performed. For the application in real-time grid operation, grid planning and in further cases when computational time is critical, a novel approach on simultaneous parallelization of many Newton-Raphson power flow calculations on CPU and with GPU-acceleration is proposed. The result shows a speed-up of over x100 comparing to the open-source tool pandapower, when performing repetitive power flows of system with admittance matrix of the same sparsity pattern on both CPU and GPU. The speed-up relies on the algorithm improvement and highly optimized parallelization strategy, which can reduce the repetitive work and saturate the high hardware computational capability of modern CPUs and GPUs well. This is achieved with the proposed batched sparse matrix operation and batched linear solver based on LU-refactorization. The batched linear solver shows a large performance improvement comparing to the state-of-the-art linear system solver KLU library and a better saturation of the GPU performance with small problem scale. Finally, the method of integrating the proposed solver into pandapower is presented, thus the parallel power flow solver with outstanding performance can be easily applied in challenging real-life grid operation and innovative researches e.g. data-driven machine learning studies.

Full PDF

FFast Parallel Newton-Raphson Power Flow Solver for Large Number of SystemCalculations with CPU and GPU

Zhenqi Wang a,b, ∗ , Sebastian Wende-von Berg a,b , Martin Braun a,b a Department of Energy Management and Power System Operation,University of Kassel,Wilhelmsh¨oher Allee 73, Kassel, 34121, Germany b Fraunhofer Institute for Energy Economics and Energy System Technology,K¨onigstor 59, 34119 Kassel, Germany

Abstract

To analyze large sets of grid states, e.g. when evaluating the impact from the uncertainties of the renewable generationwith probabilistic monte-carlo simulation or in stationary time series simulation, large number of power ﬂow calcula-tions have to be performed. For the application in real-time grid operation, grid planning and in further cases whencomputational time is critical, a novel approach on simultaneous parallelization of many Newton-Raphson power ﬂowcalculations on CPU and with GPU-acceleration is proposed. The result shows a speed-up of over x100 comparingto the open-source tool pandapower, when performing repetitive power ﬂows of system with admittance matrix of thesame sparsity pattern on both CPU and GPU. The speed-up relies on the algorithm improvement and highly optimizedparallelization strategy, which can reduce the repetitive work and saturate the high hardware computational capabilityof modern CPUs and GPUs well. This is achieved with the proposed batched sparse matrix operation and batched linearsolver based on LU-refactorization. The batched linear solver shows a large performance improvement comparing to thestate-of-the-art linear system solver KLU library and a better saturation of the GPU performance with small problemscale. Finally, the method of integrating the proposed solver into pandapower is presented, thus the parallel power ﬂowsolver with outstanding performance can be easily applied in challenging real-life grid operation and innovative researchese.g. data-driven machine learning studies.

Keywords:

Probabilistic Power Flow, Monte-Carlo Simulation, Contingency Analysis, GPU-acceleration,Newton-Raphson, Parallel Computing

1. Introduction

The penetration of Distributed Energy Resources(DER) e.g. Wind and PV causes high uncertainties inplanning and operation of power systems. For both gridplanning and operation, with the probability distributionof the infeed and load with uncertainties known, it canbe formulated as Probablistic Power Flow (PPF)-problemand evaluated with Monte-Carlo Simulation (MCS)-PowerFlow (PF). The computational eﬀort to solve the PPFproblem varies according to the complexity of the un-known variable and the sampling methods. In the pastdecades, the researches have succeeded in reducing the re-quired number of PFs [1, 2, 3]. Besides, the MCS-PFﬁnds good application on a similar evaluation for the un-certainty with historical or synthetic injection and loadproﬁles.Static N-1 contingency analysis is required for the real-time and future grid state in the grid operation, with which ∗ Corresponding author

Email address: [email protected] (Zhenqi Wang) grid congestion caused by the renewable energy needs tobe properly handled with market and operational measure-ments [4]. Thus, it is essential to evaluate large number ofPFs fast.Furthermore, recent power system research trends showan increasing demand for a powerful PF solver for manyPFs. Data-driven machine learning methods show greatpotential in power system state estimation, approximationof PF and using Artiﬁcial Neural Network (ANN) to as-sist decision making in dispatch actions in grid operation.Those approaches require large number of PFs of simi-lar grid topology to be performed in the training phase[4, 5, 6]. The proposed method ﬁnds successful applica-tion in the study of using ANN for Medium-Voltage (MV)power system state estimation [7] and training ANN withnumerical diﬀerentiation to perform economical dispatchin grid operation [8].With the open source tools e.g. matpower[9] andpandapower[10], the grid can be easily modelled and singlePF can be conveniently solved. The performance of thesetools in solving many PFs with similar grid topology is un-satisfactory. Emerging works have shown great potential

Preprint submitted to Sustainable Energy, Grids and Networks February 18, 2021 a r X i v : . [ c s . C E ] F e b f the application of massive parallel hardware e.g. GPUin acceleration single PF[11, 12, 13, 14] as well as manyPFs e.g. for static contingency analysis and online PPF[15, 16, 17, 18, 19, 20].Motivated by the aforementioned challenging use cases,a general-purpose parallel PF solver for solving many PFsfor grid with same sparse pattern of admittance matrixis presented in this work. PF with classical Newton-Raphson (NR) method is eﬃciently solved on both CPUand GPU with the proposed approach. Recent develop-ment [16, 21, 22] show the advantage of parallelization forsolving many PFs through batched operation on GPU.Our work distinguished itself by furthering to study thepossible bottlenecks in the NR algorithm. Through fur-ther optimization, the repetitive work can be reduced andthe utilization rate of available computational resourcescan be maximized.The main contributions of our work include the follow-ing. First, to optimize the performance on CPU, besidesthe task level parallelization on CPU, a new parallelizationscheme on CPU with the extra explicit Single InstructionMultiple Data (SIMD) parallelization [23], which is theﬁrst of its kind to our best knowledge as revealed in ourliterature overview. The multi-threaded SIMD LU refac-torization shows further speed up comparing to the state-of-the-art KLU[24] library with task-level parallelization.Secondly, an easy-to-implement row-level parallelizationstrategy is proposed for batched sparse matrix operationon GPU. Thirdly, an improved GPU LU refactorizationbased on the latest advances [25, 21] is proposed, whichcan increase the hardware saturation through the ﬁnalstages in the LU-refactorization process and thus improvethe performance on small batch size. Furthermore, theforward substitution backward substitution step is opti-mized with ﬁne-grained parallelization strategy. Last butnot least, the method of integrating the parallel PF solverinto the python-based open-source power system analysistool is presented, which is essential of the application intoreal-life grid planning, operation and researches.This paper is formulated in 5 sections. Section 2 intro-duces the CPU and GPU architecture and how the perfor-mance of computation tasks can be optimized respectively.Section 3 introduces the proposed approach of implemen-tation of the parallel PF solver. Section 4 introduces theproposed batched linear solver for CPU and GPU plat-form. Section 5 presents the benchmarking result withcomprehensive analysis.

2. Review of parallelization on CPU and GPU

This section gives an overview of the modern CPU andGPU regarding its specialties in order to solve many PFseﬃciently. Fig. 1 shows a generalized CPU-GPU hard-ware structure. On modern CPU, multiple physical coresare usually available, with which multiple tasks can beperformed independently. For computational tasks, a

Figure 1: CPU-GPU architecture overview. good utilization of multi-cores greatly improves the per-formance. Futhermore, with large size of on-chip cacheavailable, the memory transaction with limited memorybandwidth can be avoided on cache-hits.The SIMD instruction set is available on CPU on eachphysical core, which can execute the same operation onmultiple data resides in the extended registers with thesize of 128 bits of higher [26]. This corresponds to 2and more double precision ﬂoating-point arithmetic (FP)s,with which the overall FP capability of CPU is greatly im-proved [27]. The SIMD instruction set e.g. SSE and AVX2is available on most modern CPU, which can carry out 2and 4 double-precision FPs simultaneously.Comparing to CPU, modern GPU is designed not onlyfor graphic rendering but also as a powerful highly paral-lel programmable processor [28]. Large number of parallelStream Processor (SP)s are integrated on-chip, which re-sult in a high FPs peak performance comparing to CPU.The SPs are clustered into Stream Multiprocessor (SM)s,which contains L1-cache, shared memory and control units[29]. Besides, GPU has high memory bandwidth as shownin Fig. 1, which brings advantage on the memory-boundingtasks.

Multi-threaded task-level parallelization can be realizedwith OpenMP [30], which is a popular library in scien-tiﬁc computing for parallelization on shared-memory. Thisparallelism paradigm is called Simultaneous multithread-ing (SMT)[31]. OpenMP uses compiler directives to al-locate the parallelized regions in the serialized programduring the compiling phase.Since OpenMP 4.0, SIMD parallelization is directly sup-ported within the framework [32]. On top of the SMTparallelization, mini-batch tasks can further proﬁt fromthe usage of SIMD instruction set, this type of parallelismis denoted as SMT+SIMD.

In order to perform large-scaling scientiﬁc computationtasks on massive parallelization hardware like GPU, Single2 igure 2: Simpliﬁed model of CUDA kernel structure.

Instruction Multiple Threads (SIMT) programming modelis widely adapted. Using NVIDIA CUDA as an example,the user is able to program functions called CUDA Ker-nel, which deﬁnes the behavior of massive simultaneousthreads. A simpliﬁed 2D programming model of CUDAkernel is shown in Fig. 2. The massive threads are orga-nized in a two-level structure, with each level supports upto 3-dimensions. The ﬁrst level is called thread block, allthe threads within a block are executed on a single SMand can be synchronized explicitly with synchronizationbarrier [29]. All the thread blocks are organized in a grid.The execution of grid is distributed on all the SMs on-chip.The thread is aware of its indexing in the two-level struc-ture, which is essential to deﬁne the data-indexing relevantindividual tasks. The size of thread block and grid can beconﬁgured upon kernel call.During the execution of a CUDA kernel, 32 consecutivethreads within one block are aggregated into the mini-mum execution unit called warp [33]. The threads withinone warp execute the same instruction. When threadswithin a warp encounter diﬀerent instructions, these areexecuted in a serial mode, thus lose the advantage of par-allelism. To fully utilize the calculation capability forcomputing-bounding tasks, the ﬁrst criterion for GPU per-formance optimization is to avoid thread divergence withinone warp.Optimizing memory operation with the on-device mem-ory is another important factor. Due to the high-latencyof the graphics memory (multiple hundreds of cycles) [34],the instruction of memory transaction is queued in a First-In-First-Out way. It is possible to hide this latency andsaturate the memory bandwidth by keeping the queuebusy [33]. For each memory transaction, 32-bytes dataof consecutive memory sections is accessed, to maximizethe eﬃciency of memory transaction and reduce exces-sive memory transaction, it is important to enable thewarp access for coalesced memory, especially for memory-bounding tasks [35]. Constants and texture can be ac-cessed through speciﬁc cache, which improves the accesswith sparse matrix indexing.Any tasks on GPU can only be executed, when the datais available on the device memory. Fig. 1 shows the limited bandwidth between CPU (Host, H) and GPU (Device, D).To optimize this, the required data transaction betweenhost and device should be possibly reduced. Futhermore,CUDA oﬀered the possibility of overlapping diﬀerent op-erations (H2D, D2H memory transaction and kernel exe-cution) through the usage of CUDA stream [20].

3. Approach for a parallel NR-PF solver

In this work, an approach of the acceleration of manyNR-PFs with same sparsity pattern admittance matrix Y bus (according to the matpower convention [9]) is pre-sented. This special property leads to the same sparsitypattern of the update jacobian matrix (JM) required in theiterative solving process, which brings further advantagesfor speed-up by reducing repetitive works as following: • Reuse of indexing of sparse matrix • Reuse of static lookup for sparse matrix update • Reuse of memory working spaceFig. 3 shows exemplary the indexing data and proﬁledata required for the parallel PF solver. The Y bus is storedin Compressed Row Storage (CRS)-format with extendeddiagonal index. The standard CRS sparse matrix index-ing consists of RowP tr with the length of number of rowsplus one and

ColIx equals the number of non-zero ele-ments, which is eﬃcient for the iteration over rows of thematrix. The extra

DiagP tr to represent the data indexof the diagonal element of each row for the convenienceof iterating and aggregating on the diagonal elements (re-quired in update JM). Fig. 3b gives the aforementionedsparse indexing of the non-zero elements of Y bus in Fig. 3a.For diﬀerent types of calculation, the variable proﬁle datarequired is shown in Fig. 3c. For PPF, Y bus requires onlyonce while the P Q equals the size of the sampling. Forstatic N-1 analysis, the number of Y bus equals the numberof contingency cases (number of post-contingency matri-ces) while P Q requires only once. S bus = V bus · ( Y bus · V bus ) ∗ (1)As shown in Fig. 3a and given in Eq. (1), PF problemis deﬁned as ﬁnding the unknown complex voltage V bus ,which minimized the power injection mismatch betweenthe given complex bus injection S bus and the one calcu-lated with V bus [36]. With diﬀerent bus types, diﬀerentvalues are given as input. For slack bus the complex volt-age is given, while for PV bus the constant voltage magni-tude and active power and for the PQ bus with active andreactive power is predeﬁned (shown in Fig. 3a). The com-plex voltage is represented in polar notation ( (cid:54) V, | V | ) andcomplex power injection S bus is given in cartesian form(P, Q). For NR algorithm, the following processes are per-formed iteratively until maximum allowed iteration num-ber reached or the maximum of the power injection mis-match drops below the tolerance (typically 10 − p.u. [9]).3 igure 3: CRS sparse matrix for NR-PF and batched data.

1. Calculate Nodal Power Mismatch (NPM) ∆ P V , ∆ Q V with V bus

2. Create/Update JM J with V bus

3. Solve the linear system4. Update V bus with ∆ (cid:54) V pv,pq , ∆ | V pq | In each iteration, the convergence (power injection mis-match) is checked with the NPM calculation. Section 4explained the details of solving linear system step withLU-refactorization.

The utilization of the sparse matrix is essential to reducethe required memory space and avoid useless arithmeticoperations on zeros, due to the high sparsity rate in Y bus , J .Since the sparse matrices Y bus , J share the same sparsitypattern, when iterating over the elements, it is possible tobroadcast operations performed on these matrices amongtasks within batch. The storage of the batched proﬁledata in memory is shown in Fig. 3d). The batched pro-ﬁle data in memory is aligned on the same element withinthe batch (mini-batch) and stored in contiguous memoryaddress instead of within each task vector. On GPU, thisproperty guarantees automatically coalesced memory ac-cess and avoid the threads divergence within one warp [21],as long as the batch has a size of multiply of the defaultwarp size (32 for CUDA). With SMT+SIMD on CPU withmini-batch, it makes full utilization of the FP capabilityof the instruction set. A mini-batch of size 4 is used forthe eﬃcient usage of AVX2-instruction set.In order to achieve an early saturation of the GPU per-formance or memory bandwidth depends on the propertyof the kernel. Further actions to increase the saturation isrequired, as explained below. Calculation of NPM with V bus is given in Eq. (1). Theprocess can be performed with iterating over the element in Algorithm 1

Batched calculation of NPM row = T hreadBlock.Idy batchId = T hreadBlock.Idx tIdinBatch = T hread.Idx tId = batchId · T hreadBlock.DimX + tIdinBatch ∆ P V ( row, tId ) = − P ( row, tId ) ∆ Q V ( row, tId ) = − Q ( row, tId ) for dId in RowP tr ( row : row + 1) do col = ColIx ( dataIx ) | S | = | V ( row, tId ) | · | Y ( dId, tId ) | · | V ( col, tId ) | (cid:54) S = (cid:54) V ( row, tId ) − (cid:54) Y ( dId, tId ) − (cid:54) V ( col, tId ) if row in bus pv,pq then ∆ P V ( row, tId )+ = | S | · cos ( (cid:54) S ) end if if row in bus pq then ∆ Q V ( row, tId )+ = | S | · sin ( (cid:54) S ) end if end for the Y bus once. Due to the task independency between rowsin Y bus , extra row-level parallelism can be achieved withone CUDA thread responsible for a row of one task. Al-gorithm 1 shows the proposed process on GPU, which is acomputing-bounding task, the extra row-level paralleliza-tion improves the kernel saturation on GPU. On CPU plat-form, with SMT parallelization the taskId is given as aninput. For SMT+SIMD parallelization the arithmetic op-eration is automatically broadcasted to all the task withinthe mini-batch.4 igure 4: System overview of parallel PF solver. Similar to Algorithm 1, update of J matrix is acomputing-bounding task. Due to the independency be-tween rows, the process can be extended with row-levelparallelization for a better saturation.In the ﬁrst step, only the diagonal elements are calcu-lated with help of the DiagP tr as shown in Fig. 3b. Onthe second step, the non-diagonal elements are calculatedby iterating over the Y bus once, during which the diago-nal elements are updated. With the d { P, Q } /d { (cid:54) V, | V |} matrix consisting of four sub-matrices available, it can besubset into J matrix (without P,Q for slack and Q for PVbus) or direct into permuted form A with static lookup. Fig. 4 shows the method of integrating the aforemen-tioned parallel PF solver with pandapower. The initializa-tion step is carried out with the pandapower in python en-vironment including the initialization of the sparse index-ing of Y bus and J as well as the static lookup for updatingsparse matrix and the proﬁle data. With uniﬁed initial-ization step, the PFs can be solved with the three types(SMT, SMT+SIMD, GPU SIMT) of parallel PF solver.The SMT+SIMD requires the extra transpose to mini-batch step, while the GPU version requires the memorytransaction between device and host and the transpose tobatch step on device. The resulted power injection mis-match is checked on CPU, if any PF task cannot convergedue to numerical instability during LU refactorization, thistask will be given a second chance to get ﬁxed with theKLU numerical factorization. The branch ﬂow is calcu-lated in parallel after the ﬁnal convergence check with theresulted V bus .The proposed method, as indicated by its parallel natureof performing task-level PFs under the circumstances that Y bus and J remains the same sparsity pattern, can proﬁt at largest, when no dependency needs to be considered be-tween PFs. This kind of use cases include PPF, N-1 anal-ysis, stationary time-series analysis without switch con-ﬁguration change and some training processes in machinelearning. For use case like quasi-static time-series simula-tion in distribution grid, in which the discrete actions suchas On-Load Tap Changer (OLTC) tap position, status ofshunt compensator rely on the status continuity but hasno impact on the sparsity pattern. This type of simulationcan still be accelerated by the proposed method, with anexternal control logic to update the proﬁle with the per-sistance of the continuity of discrete variable. However,the control logic might lead to extra loops to correct theexcessive status changes, which impacts negatively on thecomputational eﬃciency. For simulations with topologychanges, such as switch position or status change, the ex-tra initialization (costly, as shown in Table 1) needs to beperformed for each new topology. When only limited vari-ations of grid conﬁguration needs to be considered, theproposed method can still accelerate the overall simula-tion.The proposed approach is realized with C++/CUDAwith an interface to Python with Cython 0.29.21[37].The programming language Python has convenient toolfor data integration and processing e.g. Pandas (High-Level)[38] and eﬃcient numerical operations with Numpy[39]. The underlying data stored with Numpy array storedin C format can be shared and ported directly to the par-allel PF solver.

4. Batched Linear System Solver

The solving of linear system is a time-consuming stepin solving NR PF, as it is in the actual pandapowerimplementation[40]. The mathematical formulation ofsolving linear system step in NR process is given in Eq. (2).The b is the active and reactive power mismatch with thegiven V , the result x is used to update the V .5 ..n · x ..n = b ..n x = [∆ (cid:54) V pv,pq , ∆ | V pq | ] b = [∆ P V , ∆ Q V ] (2)In this work, the existing process of solving linear systemis analyzed and an optimized algorithm for batched linearsystem solver is proposed. In solving linear system with LU factorization, the orig-inal linear system is pre-ordered (pre-scaled), factorizedinto lower-triangular matrix L and upper-triangular ma-trix U and ﬁnally solved with Forward Substitution (FS)-Backward Substitution (BS), as it is in the modern imple-mentations [24, 41, 42, 25]. KLU is considered one of thefastest single-thread linear solver for JM[43, 13, 21, 17],which is used as the reference for the proposed methodin this work. The KLU solver can be easily extended fortask-level parallelization with OpenMP. Following theoret-ical introduction focuses on the implementation of KLUand gives hints on the optimization for solving many PFswith same Y bus sparsity pattern.In the pre-ordering phase, the original J is permutedin order to reduce the required FPs by reducing the ﬁll-ins. Multiple heuristic methods are available for the pre-ordering, which gives diﬀerent performance related to thematrix pattern. In this work, the pre-ordering method(AMD [44]) available in KLU is used, since [43, 20] re-ported its good performance in reducing ﬁll-ins generallyon circuit simulation and PF analysis with NR algorithm.In other cases, such as to factorize augmented JM to di-rectly consider control eﬀect from e.g. OLTC, the work[45] presented an extra analysis regarding the pre-orderingmethod. With the permutation, the original linear system J is permuted into the A as given in Eq. (3). P erm col , P erm row are the correspondent column and row permuta-tion matrix. The numerical factorization is performed on A . A ..n = P erm col · J ..n · P erm row A ..n = L ..n · U ..n (3)In the numerical factorization of KLU, the Gilber-Peierlsleft-looking algorithm (G-P) algorithm is utilized. Ad-ditionally, partial pivoting is performed to improve thenumerical stability, with which the permutation matrix P erm row is updated, in order to avoid the tiny pivot. Thisstep has eﬀect on the ﬁnal sparsity pattern on A .Refactorization is a much faster process, which reducedthe computational overhead by presuming the numericalstability of the permutation matrix P erm row , P erm col .For NR iterations in solving PF and circuit simulation,refactorization is preferred [25]. The refactorization modeis also supported by KLU. The pattern of A , L and U remains unchanged for alltasks when solving many PFs with same Y bus sparsitypattern. The pre-ordering only need to be performedonce at the beginning [16]. Based on this property, astatic lookup can be created with the ﬁnal permutationmatrices P erm row , P erm col found. Since in G-P algo-rithm Compressed Column Storage (CCS) matrix is re-quired, which is eﬃcient for iterating over columns. Astatic lookup can direct convert the original CRS JM or d { P, Q } /d { (cid:54) V, | V |} matrix into the permuted A into CCSformat. Algorithm 2

Sparse G-P refactorization algorithm withcolumn working space for tId in 0: N task do for col in 0: N col do x = A (: , col, tId ) for row in URowIx(:, col) do x ( row + 1 :) − = x ( row ) · L (: , row, tId ) end for U (: , col, tId ) = x (: col ) L (: , col, tId ) = x ( col + 1 :) /x ( col ) end for end for Algorithm 2 gives the SMT LU refactorization algo-rithm. Sparse G-P[46] refactorization on CPU with col-umn working space is implemented. The working spacehas the size of the dimension of A for SMT. With thecolumn working space x , only the copy of non-zero val-ues of A is required for the sparse vector multiply andadd (VMAD).For SMT+SIMD, working space with the width of themini-batch size is required. By replacing tId with the IDof mini-batch, the copy, VMAD and normalization can beextended with SIMD instructions, so that the operation isbroadcasted within mini-batch.Since task-level parallelization can fully saturate allphysical cores of CPU with SMT and SMT+SIMD,column-level parallelization as proposed in [47] is notneeded. Recent eﬀort on LU factorization with GPU is presentedin [25, 42], both works focus on accelerating single LUfactorization on GPU, especially for large scaling matrix.For batched LU factorization, [21] presents a batched LU-factorization scheme with batch and column-level paral-lelization. It is essential to saturate the GPU memory6 igure 5: LU Parallel Levels and FPs (left: IEEE case300, right: case2869pegase).Figure 6: Example LU matrix for LU Refactorization.Figure 7: Example scheduling graph for LU Refactorization. bandwidth for LU refactorization and FS-BS to achievegood performance on GPU. Due to the underlying datadependency, not all columns can be factorized simultane-ously. A directed acyclic graph (DAG) is used to describeand schedule the columns which can be ﬁnished at thesame time [48]. The vertices represent the columns andthe edges represent the operation between two columns.An example matrix and its DAG is shown in Fig. 6 andFig. 7 respectively.Considering the G-P algorithm with column-level paral-lelization proposed in [17], as the example of two standard

Figure 8: Task scheduling on time line for LU Refactorization. grids shown in Fig. 5, at the beginning with high num-ber of available columns per level, the GPU resource canbe easily saturated even with small batch size. However,signiﬁcant amount of FPs are located in the serial region,where only few columns can be factorized at the same time.Using NVIDIA GTX1080 graphics card as an example, theGPU has 20 SMs with each SM has 128 SPs. Assumingthe batch size is 512, which equals to less than 1 activewarps per SM, which means 75% of the SP remains idlewhen no extra parallelization is applied. Consequently, thememory-bandwidth cannot be eﬃciently utilized.Inspired by the work [25], our work improves the batchedLU-refactorization with ﬁne-grained parallelization for se-rial levels, which improves the saturation of hardware sig-niﬁcantly with small batch size. This improvement makegreater sense on the future applications with even morecomputational resources and higher memory-bandwidthavailable on new generations of GPUs.

A three-stage batched G-P algorithm is proposed for theGPU refactorization, which is given in Algorithm 3. LU denotes the working space for LU refactorization ( L + U − I ) including all the ﬁll-ins predeﬁned. The values in Aneeds to be copied into LU with the ﬁll-ins initialized as0. In stage 1, with large number of columns available ineach level, the columns are fully factorized in parallel.7 lgorithm 3 Multi-stage sparse G-P algorithm on CUDA batchId = T hreadBlock.Idx tIdinBatch = T hread.Idx tId = batchId · T hreadBlock.DimX + tIdinBatch col = AvailableColsinLevel ( T hreadBlock.Idy ) col = Lef toverCols ( T hreadBlock.Idy ) for row in URowIx(:, col) do if row not in FinishedCols then break end if if row in FinishedRowsOfCol(col) then continue end if LU ( row + 1 : , col, tId ) − = LU ( row, col, tId ) · LU ( row + 1 : , row, tId ) end for if row == U RowIx ( − , col ) then UpdateFinishedColStatus(col) LU ( row + 1 : , col, tId ) / = LU ( row, col, tId ) else UpdateFinishedRowsOfCol(row, col) end if

Each CUDA thread is responsible for one column in onetask. In stage 2, besides the columns, which can befully factorized, the viable operations for all the unﬁnishedcolumns can be executed simultaneously (shown in Fig. 7as green lines). In stage 3, the extra VMAD paralleliza-tion as shown in Fig. 6 could be applied with extra CUDAthreads scheduled within the same CUDA threadBlock ,since only threads within one threadBlock can be explic-itly synchronized. The extra VMAD parallelization coulduse width of e.g. 4. In stage 2 and 3, on each level somecolumns can only be partially factorized, the already pro-cessed rows of each column are memorized, the ﬁnishedcolumns of each level will be directly normalized.Since the available columns in each parallel levels is known after the symbolic analysis step. As indicated inFig. 5, the start of level 2 and level 3 is solely related tothe available columns in each level. This parameter is re-lated to the hardware. As manually tuned in our tests, toachieve good performance, the stage 2 can start when lessthan 32 columns are available in level and stage 3 startswhen only 1 column is available.For example matrix LU shown in Fig. 6, the level one inFig. 7 corresponds to the stage 1 of Algorithm 3. After theﬁrst level is ﬁnished, tasks which belong to the columns inlater levels can be executed in an earlier stage, which inturn increased the saturation rate and reduced the tasksin the serial levels (see Fig. 8). Level 2 corresponds tothe stage 2 of the algorithm. In level 3, 4, when only onesingle column is available the extra VMAD parallelizationis applied according to stage 3. In this case, the row 8,9, 10 are assigned onto the T hread.Idy , thus VMAD canbe performed more eﬃciently with suitable parallelizationwidth instead of element-wise.

Figure 9: Parallelized FS-BS.

Fig. 9 shows the parallelization strategy of FS-BS withan incomplete L matrix. After the applying permutationmatrix to the x , FS with L matrix can be performed onthe same working space b/x . A dependency graph canbe constructed to guide the parallel execution of multiplecolumns of L . As soon as the pivot element (upper el-ement) of x corresponds to the column is ﬁnalized, thiscolumn can be executed to update the lower elements in x . When multiple threads try to update the same element,the barrier write function atomicAdd () protect from writ-ing collision. The VMAD parallelization can be applied,when few columns are available on the same level. Thesame approach is applied for U .

5. Case Studies and Performance Evaluation

The case study on CPU is performed on a Windows 10PC with Intel i7-8700 CPU (6 physical cores and Hyper-8hreading (HT) technology for 12 virtual cores) and 16GB DDR4 RAM, with the prioritized Intel C++ CompilerV2020 and the Intel Math Kernel libraries, which is highlyoptimized for the vectorized computation. The code iscompiled with O3 and forced using the AVX2 instructionset.The case study on GPU is performed on a PC withIntel i7-8700k, 32 GB DDR4 RAM and 2x Nvidia GTX1080-8GB GPUs. The PC is running on Ubuntu 18.04.4LTS. The proposed approach is programmed in C++ andcompiled with GCC V7.4 and CUDA V10.1 with O3. Themain focus of the benchmarking is the duration of the PFsolving and each subprocess. Because the kernel executionin CUDA is initialized from CPU and executed on GPUasynchronously, to record the duration of each kernel, thetest is performed with synchronization at the kernel (ormultiple-kernels e.g. for LU-refactorization) exit.On both test platforms, double precision ﬂoat num-ber is used, due to the high numerical stability require-ment during the LU refactorization. The examples aretested with small-size grid ”IEEE case300” with 300 busesand mid-size grid ”case2869pegase” with 2869 buses avail-able in pandapower. Both contain meshed Extra-High-Voltage (EHV) and High-Voltage (HV) voltage levels,IEEE case300 contains also MV and Low-Voltage (LV)buses.

This section gives a performance evaluation of the afore-mentioned batched linear solver on CPU and GPU and therelative performance to KLU.Fig. 10 shows the performance on CPU platform. Onboth test grids, a performance improvement of the imple-mented G-P algorithm can be observed over KLU againstboth numerical factorization and refactorization modes.Because it works directly on the already permuted A ma-trix, as the KLU takes the original matrix as input. WithSMT parallelization, a performance improvement of > x6can be observed for all the methods tested when increasingthe number of threads, especially when below the numberof physical cores (6). With further increasing the threadnumber, the HT technology helps further improve the sat-uration of the physical core, thus improve the overall per-formance. With the SMT+SIMD parallelization, furtherspeed-ups can be observed. However, a slight performancedecreasing can be observed on the mid-size grid under theG-P SMT+SIMD version, when increasing the threadsnumber from 6 to 12, this behavior is caused by the re-duced cache hitting rate, due to the large problem scaling.Overall, with the proposed method, on CPU platform, weachieved a good performance improvement of x20 - x70.The acceleration rate has a close relationship to the sizeof the problem (number of buses).On the GPU platform, Fig. 11 shows the benchmarkingresult with diﬀerent batch sizes, the best CPU results isused as baseline. It can be observed, that our approachof further ﬁne-grained parallelization of the refactorization process leads to a earlier saturation of the GPU resourceand achieved much better performance when the batchsize is small for both grids. With the extra VMAD paral-lelization with the 4x threads in stage 3, it improves theperformance when the batch size is very small ( ≤ For the convenience of comparing the performance ofeach functions involved in NR PF, these can be categorizedas following: • FP Dominant (Float): Calculate NPM, Update d { P, Q } /d { (cid:54) V, | V |} • Memory and FP (LU Total): LU Refactorization, FS-BS • Memory Dominant (Memory): Permute JM( A ), Up-date V The performance on CPU is evaluated with timing andIntel Vtune proﬁling tool for the memory access patternanalysis. Fig. 12 shows the time of each process of SMTparallelization and SMT+SIMD parallelization with dif-ferent number of threads. For both small and mid-sizegrid, similar to the observation in batched linear solver,the performance increased almost linear at the beginningand only slightly after 6 threads. For the mid-size grid,with the last-level cache missing rate of the SMT+SIMDwith 12 threads increased from 0.1% in total to 5% for ”LUTotal” and 25% for ”memory” comparing to with only 4threads. The random memory access pattern of ”LU To-tal” and ”memory” requires a frequent exchange betweenthe cache and system DRAM, which due to its limitedbandwidth actually, drags the overall performance backby increasing thread number. The SMT+SIMD versionstill outperforms 12 threaded SMT parallelization schemeat the thread number of 5. For the small grid, with thedata ﬁts in the cache well, the increase of threads numberimproves the overall performance.Concluding, on CPU platform, increasing number ofthreads brings only beneﬁt, when solving PF of a smallscale grid. While with SIMD+SMT, the cache should beconsidered and the number of threads should be carefully9 igure 10: Batched linear solver benchmark on CPU (left: IEEE case300, right: case2869pegase).Figure 11: Batched linear solver benchmark on GPU (left: IEEE case300, right: case2869pegase).Figure 12: Function benchmark results on CPU for the performance comparison over SMT and SMT+SIMD (left: IEEE case300, right:case2869pegase). Figure 13: Function benchmark results on GPU (left: IEEE case300, right: case2869pegase). chosen given the grid size and available hardware. On thetest CPU for instance, the number of the physical cores(in this case 6) is a good ﬁrst trial.

The GPU version requires the extra memory transactionbetween host and device, which is labelled as ”DH Mem-ory”. Fig. 13 shows the function performance on GPUwith one stream for both grids with 10,000 calculations.It can be observed that except for the ”LU Total”, theother tasks can saturate the GPU resources with smallbatch size, thus it is insensitive to the change of the batchsize for both grids. Comparing to the the best CPU perfor- mance, the ”FP” function achieved the average improve-ment among x5-6.Fig. 14 shows the improvement with the usage of CUDAconcurrency with stream and multi GPUs on diﬀerentbatch sizes. The test case with batch size 512 and 1 streamis used as base line scenario. With the multiple streams onone GPU, an improvement due to the hiding the memorytransaction between host and device can be observed withall batch sizes. When the batch size is small, the improve-ment is higher which is related to the potential of kerneloverlapping. When two GPUs are available, an accelera-tion of around factor x2 can be observed. However, dueto the imbalanced computation load distribution amongGPUs and streams. With speciﬁc stream/GPU undertakes10 able 1: Benchmark results proﬁle simulation including initialization case name case118 case300 case1354 case2869 case9241 sb mv-lv sb hv-mv sb ehv-hvnumber of buses 118 300 1354 2869 9241 115 1787 713number of branches 186 411 1991 4582 16019 115 1836 1275Timing of the initialization (ms)24 67 267 608 2211 36 353 142Timing of 10,000 PFs (ms)pandapower 18,956 47,373 149,234 464,009 1,794,955 12,366 126,062 101,279SMT best 175 778 2,666 8,323 40,835 107 2,291 1,558SMT+SIMD best 97 333 1,524 7,217 38,848 52 1,646 7211 gpu best 50 201 639 1,941 10,658 43 656 4492 gpu best 26 102 342 1,021 6,525 24 353 233

Figure 14: Duration and speedups with batches and number ofstreams. more tasks than the others, the improvement of the overallexecution time varies. Using batch size 2048 as an exam-ple, when executing 10,000 PFs, 4x2048 tasks are equallydistributed, while one GPU/Stream have to execute the1808 leftovers.

Figure 15: Acceleration factor of proposed approach comparing tobaseline case pandapower.

The benchmarking result of running 10,000 PFs of mul-tiple grids with same Y bus is listed in Table 1. The parallel PF-solver is integrated in pandapower. Besidescase2869pegase and IEEE case300, further standard grids”IEEE case118”, ”case1354pegase” and ”case9241pegase”available in pandapower are utilized to evaluate the perfor-mance under diﬀerent grid dimensions. As well as threegrids from the SimBench open-source dataset[49], whichwas recreated to represent the characteristics of real Ger-man grid and contains realistic time series data for loadsand DERs, are used for benchmark. The dataset con-tains yearly time series of 15-minutes resolution (35040total time steps), the ﬁrst 10,000 time steps are used.With multiple performance optimizations especially thecreation of JM, pandapower gives already better perfor-mance comparing to matpower [10, 40], thus the baselinecase is performed with PF-function of pandapower named”newtonpf”, which is not parallelized and used SuperLUas linear system solver.Fig. 15 shows the acceleration factor with regard tothe number of buses in the grid. The acceleration fac-tor is calculated by T pandapower /T case . It would be fairfor the evaluation of the gained acceleration with the com-parison against baseline, since the required NR iterationsand the grid topology are the same in pandapower and inthe proposed approach. As can also be seen in Table 1,with the increase of grid dimension, the acceleration onthe GPU is more signiﬁcant as on CPU, due to the highFP capability and memory bandwidth of GPU. On largegrids, e.g. case9241pegase, the GPU version is x4 timesfaster than the best CPU performance. To be noticedby case9241pegase, due to the large requirements of GPUmemory, batch size larger than 2048 failed in the bench-marking. The eﬀect of acceleration by using SMT+SIMDis also decreasing comparing to SMT due to the aforemen-tioned cache issue. On small size grid, because of the wellusage of CPU cache, the performance of CPU is satisfying.From Table 1, it can be further observed, that the griddimension has direct impact on the one-time initializationtime, which is according to Fig. 4 applied on both CPUand GPU cases. The one-time initialization takes signif-icant amount of time, since most of the code is executedin Python environment, which can be further optimized11.g. with just-in time (JIT) compilation or C/C++ inte-gration.In the test considering the number of PF calculation,when increasing the number of calculation from 100 to10,000, the number of calculation has few impact on theaverage time of solving single PF on CPU SMT andSMT+SIMD. The GPU version, due to the aforemen-tioned saturation issue, the performance is improved withthe increase of the number of calculation. After the sat-uration point of around 2000 calculations is reached, theaverage time of solving PF is almost constant.

6. Conclusion

In this paper, an approach for the parallel solving ofmany power ﬂows with CPU/GPU is proposed. When thegrid admittance matrices share the same sparsity pattern,the Newton-Raphson power ﬂows can be eﬃciently solvedin parallel. The performance is evaluated in detail withmultiple test cases covering small, mid-size and large grids.Impressive acceleration (more than 100x over open-source tool pandapower) was achieved with CPU/GPUparallelization. The performance of the fast parallel powerﬂow solver originated mainly from the following points: • Avoidance of repetitive work (sparse matrix indexinginitialization, pre-ordering, lookup creation, etc.), • Reduction of computational overhead with LU-refactorization, • Hardware-speciﬁc optimization and parallelizationstrategies.In detail, on CPU platform, the power ﬂows can beaccelerated with less eﬀort with SMT parallelization.With SMT+SIMD parallelization, the acceleration eﬀect ishighly dependent to the problem scaling. On small grids, afurther speed-up of x2-3 comparing to SMT parallelizationcan be expected. On the GPU platform, with the batch op-eration on sparse matrix, the high FP capability and highmemory bandwidth can be eﬀectively saturated. Com-paring to the CPU counterparts, the computing-boundingfunctions are accelerated signiﬁcantly, while the memory-bounding functions depend highly on the problem scaling.The outstanding performance of the proposed parallelpower ﬂow solver shows promising application in the real-time grid operation in order to allow the consideration ofuncertainties. The innovative researches in the data-drivenmachine-learning methods in power systems can be greatbeneﬁtted. Even more potential can be exploited with theapplication of the proposed solver on a high-performancecomputing clusters with multiple CPUs and GPUs.

Acknowledgements

The authors would like to thank Florian Sch¨afer andDr. Alexander Scheidler for their suggestions to improve the quality of this paper. The work was supported bythe European Union’s Horizon 2020 research and inno-vation programme within the project EU-SysFlex undergrant agreement No 773505.

References [1] H. Yu, C. Y. Chung, K. P. Wong, H. W. Lee, J. H. Zhang,Probabilistic load ﬂow evaluation with hybrid latin hypercubesampling and cholesky decomposition, IEEE Transactions onPower Systems 24 (2) (2009) 661–667. doi:10.1109/TPWRS.2009.2016589 .[2] C.-L. Su, Probabilistic load-ﬂow computation using point es-timate method, IEEE Transactions on Power Systems 20 (4)(2005) 1843–1851. doi:10.1109/TPWRS.2005.857921 .[3] J. Usaola, Probabilistic load ﬂow with correlated wind powerinjections, Electric Power Systems Research 80 (5) (2010) 528–536. doi:10.1016/j.epsr.2009.10.023 .[4] F. Schafer, J.-H. Menke, M. Braun, Contingency analysis ofpower systems with artiﬁcial neural networks, in: 2018 IEEE In-ternational Conference on Communications, Control, and Com-puting Technologies for Smart Grids (SmartGridComm), IEEE,29.10.2018 - 31.10.2018, pp. 1–6. doi:10.1109/SmartGridComm.2018.8587482 .[5] K. R. Mestav, J. Luengo-Rozas, L. Tong, Bayesian state esti-mation for unobservable distribution systems via deep learning,IEEE Transactions on Power Systems 34 (6) (2019) 4910–4920. doi:10.1109/TPWRS.2019.2919157 .[6] Y. Liu, N. Zhang, Y. Wang, J. Yang, C. Kang, Data-drivenpower ﬂow linearization: A regression approach, IEEE Trans-actions on Smart Grid 10 (3) (2019) 2569–2580. doi:10.1109/TSG.2018.2805169 .[7] J.-H. Menke, N. Bornhorst, M. Braun, Distribution systemmonitoring for smart power grids with distributed generationusing artiﬁcial neural networks, International Journal of Elec-trical Power & Energy Systems 113 (2019) 472–480. doi:10.1016/j.ijepes.2019.05.057 .[8] J.-H. Menke, Z. Wang, F. Sch¨afer, A. Scheidler, M. Braun, Ap-plication of artiﬁcial neural networks for approximating real-time economical dispatch with quasi diﬀerentiable program-ming.[9] R. D. Zimmerman, C. E. Murillo-Sanchez, R. J. Thomas, Mat-power: Steady-state operations, planning, and analysis tools forpower systems research and education, IEEE Transactions onPower Systems 26 (1) (2011) 12–19. doi:10.1109/TPWRS.2010.2051168 .[10] L. Thurner, A. Scheidler, F. Schafer, J.-H. Menke, J. Dollichon,F. Meier, S. Meinecke, M. Braun, Pandapower—an open-sourcepython tool for convenient modeling, analysis, and optimizationof electric power systems, IEEE Transactions on Power Systems33 (6) (2018) 6510–6521. doi:10.1109/TPWRS.2018.2829021 .[11] I. Ara´ujo, V. Tadaiesky, D. Cardoso, Y. Fukuyama, ´A. San-tana, Simultaneous parallel power ﬂow calculations using hybridcpu-gpu approach, International Journal of Electrical Power &Energy Systems 105 (2019) 229–236. doi:10.1016/j.ijepes.2018.08.033 .[12] C. Guo, B. Jiang, H. Yuan, Z. Yang, L. Wang, S. Ren, Perfor-mance comparisons of parallel power ﬂow solvers on gpu sys-tem, in: 2012 IEEE International Conference on Embeddedand Real-Time Computing Systems and Applications, IEEE,19.08.2012 - 22.08.2012, pp. 232–239. doi:10.1109/RTCSA.2012.36 .[13] X. Su, C. He, T. Liu, L. Wu, Full parallel power ﬂow solu-tion: A gpu-cpu-based vectorization parallelization and sparsetechniques for newton–raphson implementation, IEEE Trans-actions on Smart Grid 11 (3) (2020) 1833–1844. doi:10.1109/TSG.2019.2943746 .[14] X. Li, F. Li, H. Yuan, H. Cui, Q. Hu, Gpu-based fast decou-pled power ﬂow with preconditioned iterative solver and inexact ewton method, IEEE Transactions on Power Systems 32 (4)(2017) 2695–2703. doi:10.1109/TPWRS.2016.2618889 .[15] V. Roberge, M. Tarbouchi, F. Okou, Parallel power ﬂow ongraphics processing units for concurrent evaluation of many net-works, IEEE Transactions on Smart Grid 8 (4) (2017) 1639–1648. doi:10.1109/TSG.2015.2496298 .[16] G. Zhou, Y. Feng, R. Bo, L. Chien, X. Zhang, Y. Lang, Y. Jia,Z. Chen, Gpu-accelerated batch-acpf solution for n-1 static se-curity analysis, IEEE Transactions on Smart Grid 8 (3) (2017)1406–1416. doi:10.1109/TSG.2016.2600587 .[17] G. Zhou, R. Bo, L. Chien, X. Zhang, S. Yang, D. Su, Gpu-accelerated algorithm for online probabilistic power ﬂow, IEEETransactions on Power Systems 33 (1) (2018) 1132–1135. doi:10.1109/TPWRS.2017.2756339 .[18] M. Abdelaziz, Gpu-opencl accelerated probabilistic power ﬂowanalysis using monte-carlo simulation, Electric Power SystemsResearch 147 (2017) 70–72. doi:10.1016/j.epsr.2017.02.022 .[19] M. M. A. Abdelaziz, Opencl-accelerated probabilistic powerﬂow for active distribution networks, IEEE Transactions on Sus-tainable Energy 9 (3) (2018) 1255–1264. doi:10.1109/TSTE.2017.2781148 .[20] S. Huang, V. Dinavahi, Real-time contingency analysis onmassively parallel architectures with compensation method,IEEE Access 6 (2018) 44519–44530. doi:10.1109/ACCESS.2018.2864757 .[21] G. Zhou, R. Bo, L. Chien, X. Zhang, F. Shi, C. Xu, Y. Feng,Gpu-based batch lu-factorization solver for concurrent analysisof massive power ﬂows, IEEE Transactions on Power Systems32 (6) (2017) 4975–4977. doi:10.1109/TPWRS.2017.2662322 .[22] S. Huang, V. Dinavahi, Fast batched solution for real-time opti-mal power ﬂow with penetration of renewable energy, IEEE Ac-cess 6 (2018) 13898–13910. doi:10.1109/ACCESS.2018.2812084 .[23] A. Fog, The microarchitecture of intel, amd and via cpus:An optimization guide for assembly programmers and compilermaker.URL [24] T. A. Davis, E. P. Natarajan, Algorithm 907: Klu, a directsparse solver for circuit simulation problems, ACM Transactionson Mathematical Software 37 (3) (2010) 1–17. doi:10.1145/1824801.1824814 .[25] X. Chen, L. Ren, Y. Wang, H. Yang, Gpu-accelerated sparse lufactorization for circuit simulation with performance modeling,IEEE Transactions on Parallel and Distributed Systems 26 (3)(2015) 786–795. doi:10.1109/TPDS.2014.2312199 .[26] P. Gepner, Using avx2 instruction set to increase performance ofhigh performance computing code, Computing and Informatics36 (5) (2017) 1001–1018.[27] Evgueny Khartchenko, Vectorization: Performance with quan-tiﬁ.URL [28] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone,J. C. Phillips, Gpu computing, Proceedings of the IEEE 96 (5)(2008) 879–899. doi:10.1109/JPROC.2008.917757 .[29] NVIDIA, Cuda c++ programming guide.[30] L. Dagum, R. Menon, Openmp: an industry standard apifor shared-memory programming, IEEE Computational Scienceand Engineering 5 (1) (1998) 46–55. doi:10.1109/99.660313 .[31] D. M. Tullsen, S. J. Eggers, H. M. Levy, Simultaneous mul-tithreading: Maximizing on-chip parallelism, in: Proceedings22nd Annual International Symposium on Computer Architec-ture, 1995, pp. 392–403.[32] OpenMP Architecture Review Board, Openmp application pro-gram interface: Version 4.0. URL [33] Y. Zhang, J. D. Owens, A quantitative performance analy-sis model for gpu architectures, in: 2011 IEEE 17th Inter-national Symposium on High Performance Computer Archi-tecture, IEEE, 12.02.2011 - 16.02.2011, pp. 382–393. doi:10.1109/HPCA.2011.5749745 .[34] V. Volkov, Understanding latency hiding on gpus, Ph.D. thesis,EECS Department, University of California, Berkeley (2016).URL [35] NVIDIA, Cuda c++ best practices guide.URL https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html [36] J. J. Grainger, W. D. Stevenson, W. D. E. o. p. s. a. Stevenson,Power system analysis, McGraw-Hill, New York and London,1994.[37] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn,K. Smith, Cython: The best of both worlds, Computing inScience & Engineering 13 (2) (2011) 31–39. doi:10.1109/MCSE.2010.118 .[38] W. McKinney, Data structures for statistical computing inpython, in: Proceedings of the 9th Python in Science Confer-ence, Proceedings of the Python in Science Conference, SciPy,2010, pp. 56–61. doi:10.25080/Majora-92bf1922-00a .[39] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers,P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg,N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk,M. Brett, A. Haldane, J. F. Del R´ıo, M. Wiebe, P. Peterson,P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser,H. Abbasi, C. Gohlke, T. E. Oliphant, Array programmingwith numpy, Nature 585 (7825) (2020) 357–362. doi:10.1038/s41586-020-2649-2 .[40] F. Schafer, M. Braun, An eﬃcient open-source implementationto compute the jacobian matrix for the newton-raphson powerﬂow algorithm. doi:10.1109/ISGTEurope.2018.8571471 .[41] X. S. Li, An overview of superlu, ACM Transactions on Mathe-matical Software 31 (3) (2005) 302–325. doi:10.1145/1089014.1089017 .[42] S. Peng, S. X. D. Tan, Glu3.0: Fast gpu-based parallel sparselu factorization for circuit simulation.URL http://arxiv.org/pdf/1908.00204v3 [43] L. Razik, L. Schumacher, A. Monti, A. Guironnet, G. Bureau,A comparative analysis of lu decomposition methods for powersystem simulations 1–6 doi:10.1109/PTC.2019.8810616 .[44] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, I. S. Duﬀ, Algo-rithm 837: Amd, an approximate minimum degree ordering al-gorithm, ACM Transactions on Mathematical Software 30 (3)(2004) 381–388. doi:10.1145/1024074.1024081 .[45] I. Kocar, J. Mahseredjian, U. Karaagac, G. Soykan, O. Saad,Multiphase load-ﬂow solution for large-scale distribution sys-tems using mana, IEEE Transactions on Power Delivery 29 (2)(2014) 908–915. doi:10.1109/TPWRD.2013.2279218 .[46] John R. Gilbert, Timothy Peierls, Sparse partial pivoting intime proportional to arithmetic operations: Tr 86-783.[47] X. Chen, Y. Wang, H. Yang, Nicslu: An adaptive sparse ma-trix solver for parallel circuit simulation, IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems32 (2) (2013) 261–274. doi:10.1109/TCAD.2012.2217964 .[48] T. Nechma, M. Zwolinski, Parallel sparse matrix solution forcircuit simulation on fpgas, IEEE Transactions on Computers64 (4) (2015) 1090–1103. doi:10.1109/TC.2014.2308202 .[49] S. Meinecke, D. Sarajli´c, S. R. Drauz, A. Klettke, L.-P. Lau-ven, C. Rehtanz, A. Moser, M. Braun, Simbench—a bench-mark dataset of electric power systems to compare innovativesolutions based on power ﬂow analysis, Energies 13 (12) (2020)3290. doi:10.3390/en13123290 ..