[PDF] Load balancing for distributed nonlocal models within asynchronous many-task systems

Abstract

Full PDF

LLoad balancing for distributed nonlocal modelswithin asynchronous many-task systems

A Preprint

Pranav Gadikar

Computer Science DepartmentIndian Institute of Technology Madras, Chennai, India [email protected] Diehl

Center for Computation and TechnologyLouisiana State University, Baton Rouge, USA [email protected] K. Jha

Oden Institute for Computational Engineering and SciencesThe University of Texas at Austin, Austin, USA [email protected] 9, 2021

Abstract

In this work, we consider the challenges of developing a distributed solver for models basedon nonlocal interactions. In nonlocal models, in contrast to the local model, such as thewave and heat partial diﬀerential equation, the material interacts with neighboring points ona larger-length scale compared to the mesh discretization. In developing a fully distributedsolver, the interaction over a length scale greater than mesh size introduces additional datadependencies among the compute nodes and communication bottleneck. In this work, wecarefully look at these challenges in the context of nonlocal models; to keep the presentationspeciﬁc to the computational issues, we consider a nonlocal heat equation in a 2d setting. Inparticular, the distributed framework we propose pays greater attention to the bottleneckof data communication and the dynamic balancing of loads among nodes with varyingcompute capacity. For load balancing, we propose a novel framework that assesses thecompute capacity of nodes and dynamically balances the load so that the idle time amongnodes is minimal. Our framework relies heavily on HPX library, an asynchronous many-taskrun time system. We present several results demonstrating the eﬀectiveness of the proposedframework. K eywords Nonlocal Computational Models · Load Balancing · AMT · HPX · Parallel and DistributedComputing a r X i v : . [ c s . D C ] F e b preprint - February 9, 2021 Nonlocal models are seen in various ﬁelds; peridynamics for modeling fracture and damage in solid media [1,2], nonlocal heat and diﬀusion equation [3], cell-cell adhesion in computational biology [4, 5], and recentlyapplication of peridynamics to granular media [6]. Unlike the models based on the local interaction, forinstance, wave and heat partial diﬀerential equation, the nonlocal models include the interaction of materialpoints over a ﬁnite length scale; as a result, after the spatial discretization of the model, the discrete pointsinteract over a length scale that is larger than the mesh size (minimum spacing between the discrete points).In contrast, in the discretization of wave or heat equations, the interaction is local, i.e. , the discrete point onlyinteracts with the nearest neighbor points. The larger interaction length scale introduces a major challengein implementing a distributed solver ; a node has stronger data dependency with the neighboring node whichmeans that the message size to exchange the ghost zones increases. Thus, we need to overlap communicationwith computation to address this challenge. We carefully look at this issue in the context of the nonlocalheat equation, which is a simple nonlocal model, but still general enough that framework developed in thiswork can be applied to more complex models such as peridynamics [1, 2]. We investigate the following:( i ) Minimizing data exchange – here we use the METIS library [7] to generate optimal partitions, with aminimum data exchange between nodes. ( ii ) Hiding data exchange time – the partitions are divided suchthat we have independent sub-partitions with sub-partitions depending on the ghost zones on a diﬀerentnode. Computations on the independent partitions (partitions which do not depend on the data of othercompute nodes) are performed asynchronously while waiting for the data from the neighboring nodes forcomputations on the dependent partitions. ( iii ) Load balancing – the partitions are redistributed amongthe nodes to minimize the waiting time for the faster nodes. The shape of the partitioning obtained by theMETIS library is preserved to the maximum possible extent, in order to reduce the data dependencies.For the asynchronous function computation and synchronization, we rely heavily on the HPX [8, 9] library;HPX is the C++ standard library for parallelism and concurrency for high-level programming abstractionsand provides wait-free asynchronous execution and futurization for synchronization. One key feature, weutilize is that HPX resolves the data dependency and generates an asynchronous execution graph, whichallows us to implicitly overlap the communication and the computation.The paper is structured as follows: In Section 2 we brieﬂy introduce the nonlocal heat equation and inSection 3 we highlight the key challenges in the development of distributed solver . We discuss the keyfeatures of HPX used in our implementation in Section 5 and then present our core technical contributionsin Section 6. We propose a novel load balancing algorithm in Section 7. We demonstrate the scaling resultsof our distributed solver and load balancing algorithm in Section 8 and conclude in Section 9.

In this section, we brieﬂy introduce the nonlocal heat equation. We chose this model due to its simplicity,however, the algorithms and ideas discussed in this work should work for a more complex nonlocal fracturemodels such as peridynamics. We consider a two dimensional nonlocal diﬀusion equation for the temperature u : [0 , T ] × D R over an square domain D = [0 , for the time interval [0 , T ]. We impose a zerotemperature boundary condition on the boundary of D and specify a heat source b : [0 , T ] × D R as afunction of points on the domain and time. Let (cid:15) > ∂u ( t, x ) ∂t = b ( t, x )+ c Z B (cid:15) ( x ) J (cid:18) | y − x | (cid:15) (cid:19) ( u ( t, y ) − u ( t, x )) d y, (1)where | y − x | denotes the Euclidean norm of vector y − x in 2d, B (cid:15) ( x ) = { y : | y − x | ≤ (cid:15) } the ball of radius (cid:15) centered at x , and J = J ( r ) the inﬂuence function. We consider J = 1 for simplicity. Here, the constant c is related to the heat conductivity k of the material as follows c = ( k(cid:15) M , when dimension d = 1 kπ(cid:15) M , when dimension d = 2 , (2)where M i = R D J ( r ) r i d r is the i th moment of the inﬂuence function. (2) can be derived by substitutingthe Taylor series expansion, u ( t, y ) = u ( t, x ) + ∇ u ( t, x ) · ( y − x ) + ∇ u ( t, x ) : ( y − x ) ⊗ ( y − x ) + O ( (cid:15) ), in2 preprint - February 9, 2021 x(cid:15)B (cid:15) ( x ) D c D Figure 1: Material domain D with the nonlocal boundary D c surrounding D . Figure shows typical materialpoint x ∈ D and its neighborhood region, a ball of radius (cid:15) centered at x , B (cid:15) ( x ). We also show the spatialdiscretization of domains D and D c through a uniform grid.(1) and comparing the Laplacian term with that from the classical heat equation. We refer to [3] for morediscussion on the nonlocal diﬀusion equation. The initial condition is given by u (0 , x ) = u ( x ) ∀ x ∈ D. (3)The boundary condition is typically prescribed over the region of ﬁnite area (in 2d) or volume (in 3d). Let D c = ( − (cid:15), (cid:15) ) − D is the annuls square obtained by removing D from the bigger square ( − (cid:15), (cid:15) ) , seeFigure 1. We apply the zero temperature boundary condition on D c , i.e. , u ( t, x ) = 0 ∀ x ∈ D c and ∀ t ∈ [0 , T ] . (4)To solve for the temperature u for a given source b , we consider a ﬁnite-diﬀerence in space and forward-Eulerin time discretization of (1). We discuss the discretization next. We discretize the domain using the uniform grid with grid size h >

0, see Figure 1. Let D h and D c h denotesthe spatial coordinates of grid points after discretization of D and D c . We consider an index set K ⊂ Z such that for i ∈ K , x i = hi ∈ D h . Similarly, we consider an index set K c ⊂ Z such that for i ∈ K c , x i = hi ∈ D c h . Let { , t = ∆ t, t = 2∆ t, ..., t N = N ∆ t } , such that t N ≤ T , is the discretization of the timeinterval [0 , T ]. Here ∆ t is the size of the timestep.Let ˆ u ki denote the temperature at grid point i ∈ K ∪ K c and at time t k = k ∆ t . Using the ﬁnite-diﬀerenceapproximation and forward-Euler time-stepping scheme, we write the discrete system of equations for ˆ u ki ,for all i ∈ K and 1 ≤ k ≤ N , ˆ u k +1 i − ˆ u ki ∆ t = b ( t k , x i )+ c X j ∈ K ∪ K c , | x j − x i |≤ (cid:15) J ( | x j − x i | /(cid:15) )(ˆ u kj − ˆ u ki ) V j . (5)The boundary condition over D c translates toˆ u ki = 0 , ∀ k ∈ K c , ≤ k ≤ N. To apply the initial condition, we set ˆ u i = u ( x i ) for all i ∈ K . In the above, V j is the area occupied by thegrid point j in the mesh. For the uniform discretization with the grid size h , we have V j = h . To test the implementation of the discretization scheme in Subsection 2.1, we consider a method of con-structing the exact solution. Let w ( t, x ) = cos(2 πt ) sin(2 πx ) sin(2 πx )3 preprint - February 9, 2021 when x ∈ D and w ( t, x ) = 0 when x / ∈ D . We consider an external heat source of the form: b ( t, x ) = ∂w ( t, x ) ∂t − c Z B (cid:15) ( x ) J ( | y − x | /(cid:15) )( w ( t, y ) − w ( t, x )) d y. (6)With b given by above and the initial condition u ( x ) = w (0 , x ) = sin(2 πx ) sin(2 πx ) , we can show that u ( t, x ) = w ( t, x ) is the exact solution of (1). Next, we deﬁne the numerical error.Suppose ¯ u ( t, x ) is the exact solution and ˆ u ki for 0 ≤ k ≤ N and i ∈ K is the numerical solution. The errorat time t k is taken as e k = h d X i ∈ K | ¯ u ( t k , x i ) − ˆ u ki | , (7)where d = 1 , e k , i.e. , e = P ≤ k ≤ N e k . Fully parallelizing a serial implementation of the nonlocal equation to a distributed version deployed onan Asynchronous Many-Task System (AMT) involves numerous challenges. These challenges are primarilyrelated to the work distribution across multiple computational nodes, minimizing the idle time spent ondata exchange among computational nodes and ensuring maximum eﬃciency across all computational nodes.Designing and implementing a distributed solver that addresses and overcomes the challenges listed below isvery critical to ensure optimal performance:-1.

Mesh partitioning – breaking down the main problem into smaller sub-problems that can be distributedacross multiple computational nodes. Each computational node may contain multiple such sub-problems.This simple idea of unitized work greatly helps to simplify the work distribution and load balancingin AMTs, where there are multiple computational nodes within a cluster and each computational nodeconsists of multiple CPUs. In such a scenario, each of the sub-problems can be easily assigned to diﬀerentthreads within a single computational node to utilize multiple CPUs on a single computational node.2.

Minimizing data exchange – distributing the sub-problems among multiple computational nodesto achieve minimum data dependencies. The eﬃciency of a distributed solver is limited by the datadependency across the computational nodes. Minimum data dependency across the computational nodesensures minimum data exchange time across diﬀerent computational nodes and better scaling.3.

Hiding data exchange time – doing useful computation while waiting for the data from the neighboringcomputational nodes. Despite the optimal work distribution, the computational nodes need to exchangedata required for the computation of the nonlocal solver. To avoid keeping the computational nodes idlewhile the data is being exchanged, it is possible to perform computations on portion of owned domainthat do not depend on the data of other nodes. This asynchronous-style computation helps us to hidethe data exchange time and to ensure optimal performance of the distributed solver .4.

Load balancing – redistribution of sub-problems across multiple computational nodes according totheir load to minimize the waiting time across the faster nodes. Compute capacity of the individualcomputational nodes may vary with time, either due to scheduling of some other task or due to theintrinsic behaviour of the nonlocal model. In such a scenario, the updated compute capacity of individualnodes needs to be accounted and the load needs to be balanced so that the faster nodes do not sit idle.We describe our solution to address all of the challenges above in Section 6 and propose a novel loadbalancing algorithm in Section 7.

Formalism

We deﬁne the following terms used throughout this and the following sections:•

Distributed solver – the time-stepping scheme (5) where we start with the given initial temperatureat discrete points, ˆ u , in the domain and apply (5) to compute the temperature at successive times4 preprint - February 9, 2021 Figure 2: Discretization of the domain (grid with smaller spacing) and SDs (grid with larger spacing, notethe thicker lines). Here, we consider 4 compute nodes and 25 square SDs. Color indicates the owner ofthe SDs. As we can see a typical SD consists of 4 × i , the data from neighboring SDs within the (cid:15) distance is required. Ifthe size of the SD is bigger than (cid:15) , we see that any SD needs to communicate only with the immediateneighboring SDs. t = ∆ t, t = 2∆ t, .., t k = k ∆ t . At each timestep k , the right-hand side term of (5) must be computedﬁrst to compute the temperature at the next timestep t k +1 . This involves an underlying challenge of thedata dependency for all the points within (cid:15) distance of a given point x i . Since (cid:15) is typically higher than2 h where h is the grid size, each computational node needs to scatter and gather the temperature valuesat the degree of freedoms owned by the neighboring computational nodes.• Sub-problem (SP) – the region of the material domain managed by a computational node; i.e. thecomputational node is the owner of the degrees of freedom in that region and computes the temperaturefollowing scheme (5). For example, in Figure 2, the region colored with green, yellow, cyan, and red showfour SPs.•

Sub-domain (SD) – the region of the material domain that is processed and computed independentlywithin a given computational node. SDs are subsets of SP in a given node; i.e.

SP is further dividedinto multiple SDs. In this work, we always consider a square sub-domain. By the size of SP, we meanthe number of SDs it consists of. In Figure 2, the SDs are grids with larger spacing (see thick lines); forinstance, node 1 has 6 SDs.•

Discretized point (DP) – each SD is responsible for the discrete points within it. To be able to apply(5) and compute the temperature at the discrete points in a given SD i , SD i will have to exchange the datafrom neighboring SDs. When all the neighboring SDs of SD i are in the same computational node as SD i ,no special data exchange method is required. However, when some SDs are in neighboring computationalnodes, a proper communication method is required. For instance, in Figure 2, the green SD (owned bynode 1), on top and near yellow region, depends on the data owned by node 2. Asynchronous many-task systems (AMT):

Many asynchronous many-task (AMT) systems have beendeveloped recently. However, we focus on those with distributed capabilities since the current work focuseson the domain decomposition and load balancing for the distributed case. Here, the most notable are:Uintah [10], Chapel [11], Charm++ [12], Kokkos [13], Legion [14], and PaRSEC [15]. According to [16], onlyHPX has a C++ standard compatible API and has the highest technology readiness level.

Load balancing:

To the best of our knowledge, there are no load balancing algorithms speciﬁcally designedto handle the constraints of minimizing the data dependencies across multiple computational nodes in dis-tributed solvers for nonlocal models. [17] proposed a load balancing algorithm in the Charm++ library thatshares the underlying implementation concepts with HPX. [17] proposed a hardware-topology aware loadbalancing to minimize the application communication costs. In a slightly diﬀerent context of a single com-putational node multi-core systems, [18] proposes to use the ideas of work-stealing and dynamic coarseningto improve the data locality on multiple cores while load balancing.5 preprint - February 9, 2021

ApplicationOperating System P o li c y E n g i n e Local Control ObjectsAGASThreading NetworkingC++2z Concurrency/Parallellism APIsPerformance counters

Figure 3: Sketch of the HPX’s architecture and its components: Local Control Objects (LCOs), Thread-ing sub-system, Networking, performance counters, and Active Global Address Space (AGAS). We brieﬂyintroduce the performance counter and the local control objects components in this section. For all othercomponents, we refer to [9]. Adapted from [9].

Figure 3 sketches HPX’s architecture and its components. We brieﬂy introduce the local control objectsand the performance counters that we use in our implementation. For all other components, we refer to [9].The performance counters component provides a uniform API of a globally accessible counter to access theperformance in-situ [19]. All counters are registered within the Active Global Address Space (AGAS) [20]to poll the counters during the run time of the application.The Local Control Objects (LCOs) component provides the features hpx::async and hpx::future for theasynchronous execution and synchronization. Listing 1 shows how to compute a + b + c + d asynchronouslyusing HPX. In Line 2 the function to compute two integers is deﬁned. In Lines 6–7 the function add is launched asynchronously using hpx::async which means the function is executed on one thread andimmediately returns a hpx::future which contains the result of the computation once the thread isﬁnished. Thus, the second function call is launched and the two function calls are executed concurrently.In Line 9 a blocking synchronization of the two asynchronous function calls is needed to obtain the resultsusing the .get() function.Listing 1: Example to illustrate the usage of hpx::future and hpx::async1 : // Function to add two integers int add ( int one , int second ) { return one + second ;4 : } // Launch two function calls asynchronously < int > a add b = hpx : : async ( add , a , b ) ;7 : hpx : : future < int > c add d = hpx : : async ( add , c , d ) ;8 : // Synchronization to compute the result int r e s u l t = a add b . get ( ) + c add d . get ( ) ; In this section, we present our approach to solve the various challenges associated with the distributed solver discussed in Section 3. To understand the challenges and the compute gain in going from a serial (single6 preprint - February 9, 2021

Proposed methodology

METIS

HPX

Mesh PartitioningLoad balancer

Foreign data dependent computation dataexchangeAsynchronous

Distributed Solver independent computationForeign data

Problem decompositionand computation

Figure 4: Sketch of the proposed distributed framework (on right) and the decomposition and computationproblem ﬂow in a distributed cluster (on left).

Right : For the decomposition of the domain and thedistribution of the problems across computational nodes, METIS library was utilized. Foreign data (foreigndata is the data that is not available on the current node) independent computation is done while waitingfor the data to be available from the neighboring nodes. Foreign data-dependent computation is doneonce the data from the neighboring nodes is available. Our load balancing algorithm ensures that all thecomputational nodes have the SP (sub-problem) size in proportion to their computational power.

Left : Theproblem is ﬁrst decomposed into multiple SPs distributed to nodes (colors show the node responsible forthe SP). In the second step, see the third ﬁgure from the top, while waiting for data from the neighboringnodes, we perform computation on those SDs (sub-domains) which do not depend on the data of neighboringprocessors; the dark-colored square indicates that computation is being performed on the DPs (discretizedpoints) within it. In the third step, see the fourth ﬁgure from the top, we now perform computation on thoseregions of SP that depended on the neighboring nodes’ data; the dark-colored squares are now diﬀerent.Finally, at the end of the timestep, we check for the load on each compute node, and if needed redistributethe SDs ( i.e. change the SP of individual nodes) by applying the load balancing step discussed in Section 7.computational node and single-threaded) implementation to a fully distributed and load-balanced imple-mentation, we ﬁrst implemented a single-threaded version. Second, we extended the serial implementationto a multi-threaded version using asynchronous execution, e.g., futurization. Third, we implemented a fullydistributed asynchronous solver for (5). In this extension, we rely on the HPX’s implementation of thedistributed asynchronous computation. We demonstrate the schematics of the fully distributed and loadbalanced time-stepping scheme of nonlocal models (5) in Figure 4. In Figure 4(right), various componentsthat tackle the challenges listed in Section 3 and their interactions are shown. We now discuss our approachthat address the challenges described in Section 3:

We propose a simple decomposition of the complete square domain into smaller squares (SDs). As shownin Figure 2, we divide the discretized square domain into 5 x 5 SDs, i.e. , total 25 SDs are created. EachSD consists of 4 x 4 DPs and is responsible for the computation associated to these DPs. We consider thecomputation in an SD as a unit of work; the size of SD controls the communication burden and the size ofwork (larger SD will have more DPs and hence more computation) and therefore the size of an SD can betuned to achieve maximum performance. 7 preprint - February 9, 2021 x i x i x j x j (cid:15) (cid:15)L L C2 C1

Figure 5: Diﬀerent cases of computation in one SD (green square). The neighboring SDs owned by diﬀerentnodes is also shown. The DPs in green SD can be divided into two parts: 1) DPs on right of line L orbottom of L (see typical point x i near label C1 ) and 2) remaining DPs in green SD (see typical point x i near label C2 ). At timestep k , we can perform computation on DPs independently for case 2. Whereas forDPs associate to case 1 the computation depends on the data from neighboring nodes. The eﬃciency of a distributed solver is limited by the data dependency across the computational nodes. Forinstance, in Figure 2, the SDs belonging to a computational node depend on the data with the neighboringSDs belonging to computational nodes , and . We propose to use the METIS library [7] to addressthis challenge. METIS is a set of serial programs for providing fast and high-quality partitioning of ﬁniteelement meshes and graphs. Speciﬁcally, we use the METIS_PartMeshDual function which ensures that theresulting partition is optimal and results in minimum data exchange across the computational nodes duringthe execution. For instance, we use

METIS_PartMeshDual to distribute 25 SDs across 4 nodes as shown inFigure 2. The contiguous partitioning ensures that most of the bordering elements would exchange datawith the SDs belonging to the same computational node, thus, reducing the data exchange.

To hide the data exchange time, we propose to divide the set of DPs of any SD into two cases (also seeFigure 5):•

Case 1 consists of DPs that depend on the data from SDs of other nodes. Consider a point x i near label C1 in Figure 5; the (cid:15) neighborhood of x i consists of some DPs on SDs of other computational nodes. Atevery timestep, to compute the right-hand side in (5), the updated temperature at of DPs such as x j ofneighboring SDs must be collected.• Case 2 consists of the remaining DPs. The computation associated with these DPs is independent ofthe DPs belonging to other computational nodes.At every timestep, we propose to compute the data for the DPs belonging to

Case 2 ﬁrst, while the data forDPs belonging to

Case 1 becomes available. This ordering ensures an eﬀective hiding of the data sharingtime for the points corresponding to

Case 1 . The issue of load balancing often becomes crucial in nonlocal models, especially in nonlocal fracture mod-els [1]. In fracture/crack problems, the computation in regions containing the crack is diﬀerent from theregion not containing it; the crack line (in 2d or surface in 3d) divides the two regions such that the pointson either side of the crack line do not interact with each other. In our terminology, the SDs which containthe portion of crack will have reduced computational burden as compared to SDs not containing the crack.Another instance where the load-balancing could be crucial is when the computational capacity of a nodeis time-dependent. To the best of our knowledge, there are no well-known load balancing algorithms for8 preprint - February 9, 2021

Figure 6: Redistribution of the sub-domain s (SDs) among the computational nodes to balance the load inthe topological order. Node 1 borrows the data from the non-visited node 2, followed by nodes 4 and 3.Each node borrows SDs uniformly in all the directions to retain a contiguous locality of the SDs.nonlocal models in AMT systems. In this section, we propose a novel load balancing algorithm useful forthe cases discussed above. The only assumption we make is that the data exchange times are negligiblecompared to the amount of time spent on the computation. Our method of hiding the data exchange timesin Subsection 6.3 makes this assumption realistic and reasonable. We now discuss the key aspects of theload balancing algorithm:

Calculating the Load Imbalance

To measure the load imbalance, we use hpx::performance_counters::busy_time which reports the fraction of the time the node was busy doing computation against the totaltime the node was active. Between successive load balancing iterations, the performance counter is reset tohave the same total time span for all the computational nodes in the cluster. Signiﬁcantly diﬀerent busytimes for the diﬀerent computational nodes indicate a load imbalance. In an ideal case, the busy time shouldbe the same for all nodes. To achieve the close to perfect load-balanced state, it is necessary to assignSDs to individual nodes based on their compute capacity. To measure the compute capacity of a particularcomputational node N i , we consider the formula:Power( N i ) = ¯SD( N i )Busy Time( N i ) , (8)where ¯SD( N i ) denotes the number of SDs on node N i . Clearly, a more powerful node can handle a largernumber of SDs. Thus, Power can be seen as an accurate measure to quantify the compute capacity. Wecalculate the load imbalance in terms of SDs using the following formula:LoadImbalance( N i ) = E ( N i ) − ¯SD( N i ) , (9)where E ( N i ) is given by: E ( N i ) = Total no. of SD × Power( N i ) P j Power( N j ) . (10) Balancing the Load

We want to distribute the total work, i.e. , all the SDs, keeping in mind the computecapacity, see (9), of the individual nodes. Towards this, we propose our novel load balancing algorithm; themain idea is to borrow/lend SDs from/to a computational node, that owns the adjacent SDs. For instance,in Figure 6(top-left), the computational node 1 borrows SDs from 2 to reduce the compute burden from2. Note that the SDs belonging to 1 share their boundaries with the SDs of 2. This helps in preserving acontiguous locality of SDs, thereby minimizing the data exchange across computational nodes.The load balancing algorithm is presented brieﬂy in Algorithm 1. We calculate the load imbalance foreach computational node by (9) in Lines 2–12. In Lines 13–18, we model the data dependencies across9 preprint - February 9, 2021

Topological Ordering: -> -> -> Figure 7: Data exchange among diﬀerent pair of the nodes of the data dependency tree in the topologicalorder. Topological ordering generated for the above ﬁgure is 1 → → →

2. Hence, node 1 borrows SDsfrom its neighbouring node 2 to balance the load. Similarly, node 4 and then node 3 borrow SDs from theirneighbouring node.computational nodes using a tree T ; each node N i in the tree has a one-to-one correspondence with thecomputation node, and, an edge e between two nodes of the tree denotes the data dependencies amongthem. An edge between nodes N i and N j can exist only if there is an SD in one of the nodes such thatSD has data dependencies with SDs of the other node. For instance, the scenario in Figure 6(top-left)translates to the tree shown in Figure 7(top-left). Note that there are multiple such trees possible; we useone of the possible trees. In Line 19 of Algorithm 1, we obtain an ordering of the nodes to perform thedata redistribution in a least data-dependency ﬁrst fashion. The topological ordering used for the tree inFigure 7 is 1 → → →

2. Coming back to the algorithm, in Lines 13 - 18, we perform the actual dataredistribution among the nodes in the topological order. Topological ordering helps to borrow / lend SDsfrom diﬀerent computational nodes optimally without unbalancing the already balanced nodes. Figure 7shows the redistribution of the SDs among the tree nodes in the topological order and Figure 6 showsthe change in the SD distribution across nodes. Note that each node N i borrows / lends SDs to all thenon-visited adjacent nodes N j , uniformly in all the spatial directions in the domain. METIS generates anoptimal distribution of SDs across computational nodes for minimum data exchange. It is important topreserve the shape of the SPs obtained using METIS mesh partitioning in Subsection 6.2 to obtain the bestscaling. For instance, the node 1 in Figure 6 borrows SDs of 2 from all spatial directions to minimize thedata dependencies. At the end of the load balancing iteration, we reset the performance counters to obtainthe correct performance measurements for the new load distribution for the next load balancing step; seeLine 35. We ﬁrst validate the solver by considering a test setting and comparing the numerical solution with theanalytical solution; we refer to Subsection 2.2 for the analytical solution test details where the speciﬁc formof the external heat source b is chosen so that the model has an analytical solution. We compute the errordue to the numerical discretization following (7); the error depends on the timestep size and the mesh sizeand should decrease as the timestep size and mesh size decrease. From Figure 8 we see that the numericalerror is decreasing with a decrease in mesh size and this serve as a validation of the serial and distributedsolver . 10 preprint - February 9, 2021 Algorithm 1:

Load Balancing Algorithm . Let N is the number of computational nodes2: . Compute number of SD s (sub-domains) on each node3: compute NumSubDomains[i] for i ∈ [0 , N − . Compute computational power of node using (8)

5: compute Power[i] for i ∈ [0 , N − . Compute expected number of SD s using (10) and Power

7: compute ExpSubdomains[i] for i ∈ [0 , N − . Compute load imbalance using (9) . Positive value indicate the load on node is smaller and negative otherwise.9: for each integer i ∈ [0 , N − do

10: LoadImbalance[i] = ExpSubdomains[i]11: - NumSubDomains[i]12: end for . Set minimum imbalanced load node as root R of tree T14: Root R = argmin(LoadImbalance)15: . Each node is represented by some node in the tree T16: . An edge e between two nodes exists if there is SD that is in one of the node and is adjacent to the SP (sub-problem, set of SD s) of other node17: Tree T = construct tree(R)18: . Give root R and tree T, get topological sort(R, T) returns node ids to be processed in next step19: orderedNodes[N] = get topological sort(R, T)20: . Each node n borrows SD s only from non-visited nodes to increase its territory (set of SD s) uniformly in all thedirections21: for each node i ∈ orderedNodes [0 , N − do if LoadImbalance[ i ] == 0 then

23: continue24: end if . Compute list of non-visited adjacent nodes of node i . suppose AdjacentNonVisitedNodes contains L nodes27: compute AdjacentNonVisitedNodes28: . Number of SD s to borrow from each adjacent node29: XchngNum = LoadImbalance[ i ] / L30: for each node m ∈ AdjacentNonVisitedNodes( i ) do

31: LoadImbalance[ m ] -= XchngNum32: end for

33: LoadImbalance[ i ] = 034: end for

35: reset all( hpx::performance_counters::busy_time ) We now study the speedup based on the instruction-level parallelism using multiple threads. The mainidea is to distribute the SDs uniformly among multiple threads located on the same computational nodewhile sharing a common data structure. At every timestep, each thread is responsible for computing thetemperature and updating the common data structure for the allocated SDs. We study the speedup for 1,2, and 4 CPU scenarios for ﬁxed and variable problem sizes. The single CPU execution time is the baselinefor the speedup plots.

Strong scaling of asynchronous 2d nonlocal equation

In Figure 9, we present the scaling results forthe asynchronous implementation of (5) for the ﬁxed problem size. We consider a ﬁxed 400 ×

400 mesh andmeasure the eﬀect of the decomposition into SDs of diﬀerent sizes. We consider SDs of four sizes in fourcases: 1) 1 × i.e. , the entire square domain, 2) 2 × i.e. , entire square domain is divided into a total of 4partitions (2 along each of the axes), 3) 4 ×

4, and 4) 8 ×

8. The strong scaling plot in Figure 9 indicates alinear dependence of the execution time on the number of CPUs.

Weak scaling of asynchronous 2d nonlocal equation

In Figure 10, we present the scaling of theasynchronous implementation of (5) variable problem size. We study the eﬀect of increasing the mesh sizeby increasing the number of SDs of ﬁxed size 50 ×

50 along the X and the Y axes. Eight diﬀerent types ofproblem sizes that we considered are illustrated using the following examples: 1 × i.e. , the total problemsize is 50 ×

50; 2 × i.e. , the total problem size is 100 × × i.e. , the total problem size is 150 × preprint - February 9, 2021 · − . .

15 0 . . · − Discretization parameter h M a x i m u m r e l a t i v ee rr o r Plot for h versus error Figure 8: Plot of the total error e = P k e k , see (7) for diﬀerent mesh sizes h = 1 / 2 n , where n = 2 , , .., Sp ee dup Strong scaling speedup plot : Asynchronous CP U CP U CP U

Figure 9: Strong scaling results of the asynchronous solver for mesh size = 400 ×

400 with (cid:15) = 8 h and no.of timesteps = 20 ( i.e. N = 20 in N ∆ t = T ). The entire mesh of size 400 ×

400 is divided into equal sizedSDs. The size of the SDs is varied to keep the total mesh size constant, e.g.

For number of SDs = 16, thepartitioning is as follows: 4 × i.e. X and Y directions) each of size 100 ×

100 ( i.e. X and Y directions in each SD).8 × i.e. , the total problem size is 400 × We now study the speedup based on data-level parallelism using multiple computational nodes. We dis-tribute the SDs uniformly across multiple computational nodes. Each computational node is responsiblefor computing the temperature corresponding to its SDs. In this process, the computational nodes mightexchange data to satisfy the nonlocal dependencies. In our experiments, we study the speedup for 1, 2, and4 computational node scenarios for ﬁxed and variable problem sizes. Single computational node executiontime is the baseline for the speedup plots.The distribution of SDs across 1, 2 and 4 computational nodes is as follows: 1 Node: Entire square domainis on a single node; 2 nodes: Entire square domain is divided into 2 equal sized halves. For the number ofSDs = 4 ×

4, we divide the square domain into 2 halves of equal size – 2 × ×

4. Each half is assigned12 preprint - February 9, 2021 Sp ee dup Weak scaling speedup plot : Asynchronous Node Node Node

Figure 10: Weak scaling results of the asynchronous solver for SD size = 50 ×

50 ( i.e.

50 DPs along X and Y directions), (cid:15) = 8 h and no. of timesteps = 20 ( i.e. N = 20 in N ∆ t = T ). The number of SDs isincreased along both X and Y directions, keeping the size of the SDs constant. The total mesh size is givenby 50 n × n , where n is the number of SDs. Sp ee dup Strong scaling speedup plot : Distributed CP U CP U CP U

Figure 11: Strong scaling results of the distributed solver for mesh size = 400 ×

400 with (cid:15) = 8 h and no. oftimesteps = 20 ( i.e. N = 20 in N ∆ t = T ). The entire mesh of size 400 ×

400 is divided into equal sizedSDs. The size of the SDs is varied to keep the total mesh size constant, e.g.

For number of SDs = 16, thepartitioning is as follows: 4 × i.e. X and Y directions) each of size 100 ×

100 ( i.e. X and Y directions in each SD).to diﬀerent computational nodes; and 4 nodes: Entire square domain is divided into 4 equal sized squares,each assigned to distinct computational nodes. Strong scaling of the distributed 2d nonlocal equation

In Figure 11, we present the scaling of thedistributed implementation of (5) for a ﬁxed problem size. We study the eﬀect of decomposition of a meshwith ﬁxed size 400 ×

400 into a diﬀerent number of SDs to demonstrate the speedup. We consider SDs of foursizes in four cases: 1) 1 × i.e. , the entire square domain, 2) 2 × i.e. , entire square domain is divided intoa total of 4 partitions (2 along each of the axes), 3) 4 ×

4, and 4) 8 ×

8. The strong scaling plot in Figure 11indicates a linear dependence of the speedup on the number of computational nodes.

Weak scaling of the distributed 2d nonlocal equation

In Figure 12, we present the scaling of thedistributed implementation of (5) with variable problem size. We study the eﬀect of increasing the meshsize by increasing the number of SDs where each SD is of ﬁxed size 50 ×

50. Eight diﬀerent types of problem13 preprint - February 9, 2021 Sp ee dup Weak scaling speedup plot : Distributed Node Node Node

Figure 12: Weak scaling results of the distributed solver for SD size = 50 ×

50 ( i.e.

50 DPs along X and Y directions), (cid:15) = 8 h and no. of timesteps = 20 ( i.e. N = 20 in N ∆ t = T ). The number of SDs isincreased along both X and Y directions, keeping the size of the SDs constant. The total mesh size is givenby 50 n × n , where n is the number of SDs. The distribution of SDs across the computational nodes is doneusing METIS library. Sp ee dup Distributed scaling: Domain decomposition using METIS

MeasuredOptimal

Figure 13: Distributed scaling results of the distributed solver for mesh size = 800 × × i.e.

50 DPs along X and Y directions), (cid:15) = 8 h and no. of timesteps = 20 ( i.e. N = 20 in N ∆ t = T ). Totalno. of SDs is 16 ×

16 ( i.e.

16 SDs along X and Y directions). The distribution of SDs across the varying no.of computational nodes is done using METIS library.sizes that we considered are illustrated using the following examples: 1 × i.e. total problem size is 50 × × i.e. total problem size is 100 × ×

3; and 8 ×

8. The weak scaling plot in Figure 12 indicates alinear dependence of the speedup with an increase in the number of computational nodes, irrespective of theproblem size.

Distributed scaling using METIS for mesh partitioning

We now study the scaling of the distributedsolver where we keep the size of the problem ﬁxed ( i.e the mesh size is ﬁxed) and increase the number ofcomputational nodes; see the plot of speedup with diﬀerent number of nodes in Figure 13. We considera ﬁxed 800 ×

800 mesh (800 grid points in X and Y directions); from this mesh we generate 16 ×

16 squareSDs with each SD consisting of 50 ×

50 grid points. We then use METIS library for distribution of SDsacross multiple computational nodes. Advantages of distributing the SDs instead of the original ﬁne meshare as follows: the partitioning using METIS is very fast, since the number of SDs is much smaller than thenumber of grid points in the original mesh; I/O time is reduced since we only need to read and write the SD14 preprint - February 9, 2021

Figure 14: Redistribution of 5 × , , ,

4. Starting from a very imbalanced load distri-bution(left), within 3 iterations, the load balancing algorithm is able to achieve a very balanced distributionacross 4 computational nodes(right).allocation information; and, lastly, SDs owned by a node can further be distributed across multiple threadswithin that node for parallel computation.The distributed scaling plot in Figure 13 indicates a linear relationship between the speedup and the numberof computational nodes. As the number of computational nodes increases, the number of boundary SDsrequiring data exchange increases. This increase in the data exchange leads to a slight deviation from thestraight line.

Validation of the load balancing algorithm

To verify the load balancing algorithm in Algorithm 1,we consider 4 symmetric nodes and assign a highly imbalanced load to each of them in the beginning. Wedemonstrate in Figure 14 that within 3 iterations, the load balancing algorithm is able to redistribute theSDs among various nodes with nearly balanced load distribution.

We proposed the ideas of – ( i ) coarsening the mesh for higher granularity, ( ii) eﬃcient mesh partitioningusing METIS, ( iii ) hiding data exchange time using asynchronous computation and demonstrated good weakand strong scaling for the distributed implementation. We presented a novel load balancing algorithm byusing HPX to schedule the tasks asynchronously (using hpx::future and hpx::async ) and local controlobjects for synchronization. The major contribution is the novel load balancing algorithm that utilizes HPX’sperformance counters to address the speciﬁc challenges of nonlocal models. We used SDs (sub-domains) asa unit of exchange to ensure simplicity in modelling the data exchange and to preserve the data locality. Weproposed to preserve the contiguous locality obtained using the METIS mesh partitioning by borrowing theSDs uniformly in all the spatial directions; the redistributed load after the load-balancing step minimizesthe data exchange time. In one example, we could show that the proposed algorithm balanced a largelyunbalanced domain within three iterations. For future work, we intend to investigate the addition of speciﬁcperformance counters and networking counters to the load balancer. From the geometry perspective, a morecomplex non-square domains and three dimensions will be investigated. From the model perspective, a morecomplex models, e.g. nonlocal mechanics [1, 2] will be investigated. Acknowledgements

We are grateful for the support of the Google Summer of Code program which funded P. Gadikar’s summer internship. preprint - February 9, 2021 Supplementary materials

The source code to reproduce the results are available on GitHub . References [1] P. Diehl, S. Prudhomme, and M. L´evesque, “A review of benchmark experiments for the validation of peridy-namics models,”

Journal of Peridynamics and Nonlocal Modeling , vol. 1, no. 1, pp. 14–35, 2019.[2] P. K. Jha and R. P. Lipton, “Kinetic relations and local energy balance for lefm from a nonlocal peridynamicmodel,”

International Journal of Fracture , vol. 226, no. 1, pp. 81–95, 2020.[3] N. Burch and R. Lehoucq, “Classical, nonlocal, and fractional diﬀusion equations on bounded domains,”

Inter-national Journal for Multiscale Computational Engineering , vol. 9, no. 6, 2011.[4] N. J. Armstrong, K. J. Painter, and J. A. Sherratt, “A continuum approach to modelling cell–cell adhesion,”

Journal of theoretical biology , vol. 243, no. 1, pp. 98–113, 2006.[5] C. Engwer, C. Stinner, and C. Surulescu, “On a structured multiscale model for acid-mediated tumor invasion:the eﬀects of adhesion and proliferation,”

MMMA , vol. 27, no. 07, pp. 1355–1390, 2017.[6] P. K. Jha, P. S. Desai, and R. Lipton, “Peridynamics-based discrete element method (peridem) model of granularsystems involving breakage of arbitrarily shaped particles,” arXiv preprint arXiv:2010.07218 , 2020.[7] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,”

SIAMJournal on scientiﬁc Computing , vol. 20, no. 1, pp. 359–392, 1998.[8] T. Heller, P. Diehl, Z. Byerly, J. Biddiscombe, and H. Kaiser, “HPX – An open source C++ Standard Libraryfor Parallelism and Concurrency,” in

Proceedings of OpenSuCo 2017, Denver , Colorado USA, November 2017(OpenSuCo’17) , 2017, p. 5.[9] H. Kaiser, P. Diehl, A. S. Lemoine, B. A. Lelbach, P. Amini, A. Berge, J. Biddiscombe, S. R. Brandt, N. Gupta,T. Heller, K. Huck, Z. Khatami, A. Kheirkhahan, A. Reverdell, S. Shirzad, M. Simberg, B. Wagle, W. Wei, andT. Zhang, “Hpx - the c++ standard library for parallelism and concurrency,”

Journal of Open Source Software ,vol. 5, no. 53, p. 2352, 2020.[10] J. D. d. S. Germain, J. McCorquodale, S. G. Parker, and C. R. Johnson, “Uintah: A massively parallel prob-lem solving environment,” in

Proceedings the Ninth International Symposium on High-Performance DistributedComputing . IEEE, 2000, pp. 33–41.[11] B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel programmability and the chapel language,”

TheInternational Journal of High Performance Computing Applications , vol. 21, no. 3, pp. 291–312, 2007.[12] L. V. Kale and S. Krishnan, “Charm++ a portable concurrent object oriented system based on c++,” in

Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications ,1993, pp. 91–108.[13] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability throughpolymorphic memory access patterns,”

Journal of Parallel and Distributed Computing , vol. 74, no. 12, pp.3202–3216, 2014.[14] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion: Expressing locality and independence with logicalregions,” in

SC’12: Proceedings of the International Conference on High Performance Computing, Networking,Storage and Analysis . IEEE, 2012, pp. 1–11.[15] G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. H´erault, and J. J. Dongarra, “Parsec: Exploiting hetero-geneity to enhance scalability,”

Computing in Science & Engineering , vol. 15, no. 6, pp. 36–45, 2013.[16] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier,S. Markidis, H. Jordan et al. , “A taxonomy of task-based parallel programming technologies for high-performancecomputing,”

The Journal of Supercomputing , vol. 74, no. 4, pp. 1422–1434, 2018.[17] E. Jeannot, E. Meneses, G. Mercier, F. Tessier, and G. Zheng, “Communication and topology-aware load bal-ancing in charm++ with treematch,” in , 2013, pp. 1–8.[18] J. Liﬄander, S. Krishnamoorthy, and L. V. Kale, “Optimizing data locality for fork/join programs using con-strained work stealing,” in

SC14 , 2014, pp. 857–868.[19] P. A. Grubel,

Dynamic Adaptation in HPX: A Task-based Parallel Runtime System . New Mexico State Uni-versity, 2016.[20] P. Amini and H. Kaiser, “Assessing the performance impact of using an active global address space in hpx: Acase for agas,” in . IEEE, 2019, pp. 26–33. https://github.com/nonlocalmodels/nonlocalheatequationhttps://github.com/nonlocalmodels/nonlocalheatequation