[PDF] A matrix-free approach for finite-strain hyperelastic problems using geometric multigrid

Abstract

The performance of finite element solvers on modern computer architectures is typically memory bound for sufficiently large problems. The main cause for this is that loading matrix elements from RAM into CPU cache is significantly slower than performing the arithmetic operations when solving the problem. In order to improve the performance of iterative solvers within the high-performance computing context, so-called matrix-free methods are widely adopted in the fluid mechanics community, where matrix-vector products are computed on-the-fly. To date, there have been few (if any) assessments into the applicability of the matrix-free approach to problems in solid mechanics. In this work, we perform an initial investigation on the application of the matrix-free approach to problems in quasi-static finite-strain hyperelasticity to determine whether it is viable for further extension. Specifically, we study different numerical implementations of the finite element tangent operator, and determine whether generalized methods of incorporating complex constitutive behavior might be feasible. In order to improve the convergence behavior of iterative solvers, we also propose a method by which to construct level tangent operators and employ them to define a geometric multigrid preconditioner. The performance of the matrix-free operator and the geometric multigrid preconditioner is compared to the matrix-based implementation with an algebraic multigrid preconditioner on a single node for a representative numerical example of a heterogeneous hyperelastic material in two and three dimensions. We conclude that the application of matrix-free methods to finite-strain solid mechanics is promising, and that is it possible to develop numerically efficient implementations that are independent of the hyperelastic constitutive law.

Full PDF

AA matrix-free approach for ﬁnite-strain hyperelasticproblems using geometric multigrid

Denis Davydov a, ∗ , Jean-Paul Pelteret a , Daniel Arndt b , Paul Steinmann a,c a Chair of Applied Mechanics, Friedrich-Alexander-Universit¨at Erlangen-N¨urnberg, Egerlandstr.5, 91058 Erlangen, Germany b Interdisciplinary Center for Scientiﬁc Computing (IWR), Heidelberg University, ImNeuenheimer Feld 205, 69120 Heidelberg, Germany c Glasgow Computational Engineering Center (GCEC), University of Glasgow, G12 8QQGlasgow, United Kingdom

Abstract

The performance of ﬁnite element solvers on modern computer architectures istypically memory bound for su ﬃ ciently large problems. The main cause for thisis that loading matrix elements from RAM into CPU cache is signiﬁcantly slowerthan performing the arithmetic operations when solving the problem. In order toimprove the performance of iterative solvers within the high-performance com-puting context, so-called matrix-free methods are widely adopted in the ﬂuid me-chanics community, where matrix-vector products are computed on-the-ﬂy.To date, there have been few (if any) assessments into the applicability of thematrix-free approach to problems in solid mechanics. In this work, we performan initial investigation on the application of the matrix-free approach to problemsin quasi-static ﬁnite-strain hyperelasticity to determine whether it is viable forfurther extension. Speciﬁcally, we study di ﬀ erent numerical implementations ofthe ﬁnite element tangent operator, and determine whether generalized methodsof incorporating complex constitutive behavior might be feasible. In order to im-prove the convergence behavior of iterative solvers, we also propose a method bywhich to construct level tangent operators and employ them to deﬁne a geometricmultigrid preconditioner. The performance of the matrix-free operator and the ge- ∗ Corresponding author.

Email addresses: [email protected] (Denis Davydov), [email protected] (Jean-Paul Pelteret), [email protected] (Daniel Arndt), [email protected] (PaulSteinmann)

Preprint submitted to CMAME May 1, 2019 a r X i v : . [ m a t h . NA ] A p r metric multigrid preconditioner is compared to the matrix-based implementationwith an algebraic multigrid preconditioner on a single node for a representativenumerical example of a heterogeneous hyperelastic material in two and three di-mensions. We conclude that the application of matrix-free methods to ﬁnite-strainsolid mechanics is promising, and that is it possible to develop numerically e ﬃ -cient implementations that are independent of the hyperelastic constitutive law. Keywords: adaptive ﬁnite element method, geometric multigrid, ﬁnite-strain,matrix-free, hyperelasticity

1. Introduction

The performance of ﬁnite element solvers on modern computer architecturesis typically memory bound for su ﬃ ciently large problems. The main cause forthis is that loading pre-computed matrix elements from RAM into CPU cacheis signiﬁcantly slower than performing the arithmetic operations when solvingthe problem. In order to improve the performance of iterative solvers, so-calledmatrix-free methods, where matrix-vector products are formed on-the-ﬂy [1–5],are widely adopted in the ﬂuid mechanics community. Such methods can also bee ﬃ ciently implemented on GPUs [6, 7].To the best of our knowledge, matrix-free methods are not widely adoptedwithin the solid mechanics community. We are only aware of a single paper cur-rently in preparation that deals with small strain linear elasticity [8]. One possiblereason is that the tangent operators (stemming from linearization of the non-linear balance equations) are more elaborate as compared to the operators usedoften in both linear and non-linear ﬂuid mechanics, and it is not obvious whethermatrix-free methods can be advantageous in this case. Another reason could be thefrequent use of lower-order elements in solid mechanics due to lacking smooth-ness of the underlying solution. The ﬁnite element method (FEM) with linear orquadratic elements (that reduce the potential for locking), or mixed or enhancedformulations (that avoid locking) are often adopted in the solid mechanics com-munity, whereas higher order elements are rarely employed. This can potentially Here and below ’operator’ denotes a ﬁnite-dimensional linear operator deﬁned on the vectorspace R N , where N is the number of unknown degrees of freedom in the FE discretization. Al-though any linear mapping on a ﬁnite-dimensional space is representable by a matrix, we adoptthe ‘operator’ term to avoid confusion between its matrix-free and matrix-based numerical imple-mentation. unknownsinevitably require a matrix-free approach as there is simply not enough memoryto store the sparse tangent matrix [5]. Therefore, we believe that for large scalecomputations in solid mechanics it is crucial to consider matrix-free approachesand develop e ﬃ cient multigrid solvers.In this work, we perform a preliminary investigation of di ﬀ erent implementa-tions of the ﬁnite-strain hyperelastic tangent operator with respect to their compu-tational cost and performance on a single node of a high performance computingcluster. The deal.II [11] ﬁnite element library, which provides MPI paralleliza-tion together with a generalized matrix-free framework that incorporates SIMDvectorization and an interface to high-performance, MPI distributed matrix-basedlibraries, is used for this study. In order to improve the convergence of iterativesolvers, we also propose a method by which to construct level matrix-free tangentoperators and employ them to deﬁne a geometric multigrid preconditioner.The paper is organized as follows: In Section 2 we brieﬂy introduce the partialdi ﬀ erential equations used in ﬁnite-strain solid mechanics. Sections 3 and 4 coverthe ﬁnite element discretization and provide details of the numerical frameworkwith which this study is implemented. In 5 we describe the di ﬀ erent matrix-freeoperator implementations that are evaluated in this study, and in section 6 we pro-pose a geometric multigrid preconditioner that is suitable for the matrix-free im-plementation of ﬁnite-strain hyperelasticity. The performance of the matrix-freeoperator and the geometric multigrid preconditioner is compared to the matrix-based implementation with an algebraic multigrid preconditioner for a represen-tative numerical example of a heterogeneous hyperelastic material simulated intwo and three dimensions in Section 7. Lastly, the results are summarized in Sec-tion 8.

2. Theoretical background

The deformation of a body B from the referential conﬁguration B to the spa-tial conﬁguration B t at time t is deﬁned via the deformation map ϕ t : B → B t ,which places material points of B into the Euclidean space E . The spatial loca-tion of a material particle X is given by x = ϕ t ( X ). The displacement ﬁeld reads u = x − X . 3he tangent map is linear such that d x = F · d X , where F : = Grad ϕ ≡ I + Grad u is called the deformation gradient and I is the second-order unit tensor. Themapping has to be one-to-one and must exclude self-penetration. Consequently,the Jacobian J = det F > A and A by a linear transformation χ deﬁned as χ ( A ) i j = F iA A AB F jB (1) χ ( A ) i jkl = F iA F jB A ABCD F kC F lD . (2)Note that d x = χ (d X ).In what follows, we parameterize the material behavior using the right Cauchy-Green tensor C : = F T · F . We will also use the Green-Lagrange strain tensor E : = [ C − I ] and the left Cauchy-Green tensor b : = F · F T . Clearly J = √ det C and ∂ ( • ) /∂ E = ∂ ( • ) /∂ C . For conservative systems without body forces, the totalpotential energy functional E is introduced as E = (cid:90) B ψ ( F ) d V − (cid:90) ∂ B N T · u d S , (3)where T is the prescribed loading at the Neumann part of the boundary ∂ B N inthe referential conﬁguration, ψ denotes the strain-energy per unit reference vol-ume and u should satisfy the prescribed Dirichlet boundary conditions u = u on ∂ B D : = ∂ B \ ∂ B N (cid:44) ∅ .The principle of stationary potential energy at equilibrium requires that thedirectional derivative with respect to the displacementD δ u E : = dd (cid:15) E ( u + (cid:15)δ u ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) = = ∀ δ u . (4)vanishes for all directions δ u which satisfy homogeneous Dirichlet boundary con-ditions. This leads to the following scalar-valued non-linear equation F ( u , δ u ) = (cid:90) B P : Grad δ u d V − (cid:90) ∂ B N T · δ u d S = , (5)where P : = ∂ψ/∂ F is the Piola stress tensor. The double contraction in the ﬁrstterm can be re-written in terms of the symmetric Kirchho ﬀ stress tensor τ : = P · F T as P : Grad δ u = (cid:104) τ · F − T (cid:105) : Grad δ u = τ : (cid:104) Grad δ u · F − (cid:105) = τ : grad δ u , (6)4nd therefore F ( u , δ u ) = (cid:90) B τ : grad s δ u d V − (cid:90) ∂ B N T · δ u d S = , (7)where grad s δ u : = (cid:104) grad δ u + (grad δ u ) T (cid:105) is involved due to the symmetry of τ . Note that τ is the push-forward transformation of the Piola-Kirchho ﬀ stress S : = ∂ψ/∂ E ≡ ∂ψ/∂ C , that is τ = χ ( S ) = F · S · F T . In order to solve (7) using Newton’s method, a ﬁrst order approximationaround a given solution ﬁeld u is required such that F ( u + ∆ u , δ u ) ≈ F ( u , δ u ) + D ∆ u F ( u , δ u ) , (8)where D ∆ u ( • ) denotes the directional derivative in the direction ∆ u . For conser-vative traction boundary conditions, the directional derivative is given byD ∆ u F ( u , δ u ) = (cid:90) B D ∆ u (cid:16) F · S · F T (cid:17) : grad s δ u d V + (cid:90) B τ : (cid:104) Grad δ u · D ∆ u F − (cid:105) d V . (9)Following [12], this can be simpliﬁed to D ∆ u F ( u , δ u ) = (cid:90) B grad s ∆ u : J C : grad s δ u d V + (cid:90) B grad δ u : (cid:104) grad ∆ u · τ (cid:105) d V . (10)Here ( • ) is used to denote quantities evaluated using the displacement ﬁeld u ,and the fourth-order material part of the spatial tangent sti ﬀ ness tensor is thepush forward of the material part of the referential tangent sti ﬀ ness tensor J C = χ (cid:16) d ψ ( C ) d C ⊗ d C (cid:17) . The ﬁrst term in (10) is related to the linearization of the Piola-Kirchho ﬀ stress S and is therefore called material part of the directional derivative.The second term in (10) is called the geometric part of the directional derivativesince it originates from the linearization of F , F T and F − . To that end,D ∆ uE = (cid:104) D ∆ uF T · F + F T · D ∆ uF (cid:105) , D ∆ uF = Grad ∆ u andD ∆ uF − = − F − · D ∆ uF · F − are employed together with D ∆ uS = ∂ S /∂ C : D ∆ uE . .3. Constitutive modelling For the current study, we use the compressible Neo-Hookean model ψ ( C ) = µ C − tr I − J )] + λ ln ( J ) (11)where µ and λ denote the shear modulus and Lam´e parameter respectively. De-rived using statistical thermodynamics applied to cross-linked polymers, the Neo-Hookean model [13, 14] is commonly used to model rubber-like materials in theﬁnite-strain regime. It can be shown that the ﬁrst and second derivatives of thestrain energy function are given by d ψ ( C ) d C = µ I − (cid:2) µ − λ ln ( J ) (cid:3) C − (12) d ψ ( C ) d C ⊗ d C = (cid:2) µ − λ ln ( J ) (cid:3) (cid:34) − d C − d C (cid:35) + λ C − ⊗ C − (13)The Kirchho ﬀ stress and its associated fourth-order material part of the spatialtangent tensor are τ ≡ J σ = χ (cid:32) d ψ ( C ) d C (cid:33) = µ b − (cid:2) µ − λ ln ( J ) (cid:3) I (14)and J C = χ (cid:32) d ψ ( C ) d C ⊗ d C (cid:33) = (cid:2) µ − λ ln ( J ) (cid:3) S + λ I ⊗ I (15)where S is the fourth-order symmetric identity tensor with S i jkl = (cid:104) δ ik δ jl + δ il δ jk (cid:105) .The action that J C performs when contracted with an arbitrary rank-2 symmetrictensor is therefore J C : ( • ) = (cid:2) µ − λ ln ( J ) (cid:3) ( • ) + λ tr ( • ) I . (16)

3. Finite Element discretization

We now introduce a FE triangulation B h of B and the associated FE space ofcontinuous piecewise elements with polynomial space of ﬁxed degree. The dis-placement ﬁelds are given in a vector space spanned by standard vector–valued FEbasis functions N i ( x ) (e.g. polynomials with local support on a patch of elements): u h = : (cid:88) i ∈I u i N i ( X ) δ u h = : (cid:88) i ∈I δ u i N i ( X ) , (17)6here the superscript h denotes that this representation is related to the FE meshwith size function h ( X ) and I is the set of unknown degrees of freedom (DoF).The Newton-Raphson solution approach is adopted for the nonlinear problemconsidered here. Given the current trial solution ﬁeld u h , the correction ∆ u h ﬁeldis obtained as the solution to the following system of equations (see (7) and (10)): (cid:88) j ∈I A i j ∆ u j = − F i (18) A i j ≡ a ( N i , N j ) = (cid:90) B grad s N i : J C : grad s N j + grad N i : (cid:104) grad N j · τ (cid:105)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) ˜ a ( N i , N j ) d V (19) F i = (cid:90) B τ : grad s N i d V − (cid:90) ∂ B N T · N i d S . (20)Here, A is the discrete tangent operator, F is the discrete gradient of the potentialenergy and ˜ a ( N i , N j ) is a representation for the integrand in the bilinear form a ( N i , N j ) for the trial solution ﬁeld u h . Note that grad s N i and grad N i are gradientsof shape functions in the spatial conﬁguration, whereas the integration is done inthe referential conﬁguration.

4. Description of the utilized numerical framework

The discretized system of linear equations formulated in Section 3, in con-junction with the constitutive model described in Section 2.3 were implementedusing deal.II version 9.0 [11], an open-source ﬁnite element library. This li-brary o ﬀ ers a number of paradigms by which to parallelize a ﬁnite-element code,and as a member of the xSDK (Extreme-scale Scientiﬁc Software DevelopmentKit) ecosystem [15] aims to support exascale computing. In this work we employ deal.II ’s interface to p4est [16] library to perform a standard domain decom-position, whereafter each MPI process “owns” only a subset of elements and themesh is distributed across all MPI processes. Figure 1 illustrates a distributed MPIdecomposition on the ﬁne mesh level for the 2D problem that will be introducedlater in Section 7.In practice, this implies that parallelization of the assembly operation (forthe matrix-based approach) and the linear solver (as well as various pre- andpost-processing steps), both of which are the focal points of investigation in thismanuscript, has been achieved using MPI. For the linear solver, deal.II ’s im-plementation of the preconditioned conjugate gradient (CG) solver for symmetric7 igure 1: 2D mesh after two global reﬁnements distributed into three MPI processes. The colorindicates the process owning the element. positive-deﬁnite systems is consistently used, leaving the choice for method ofassembly and linear system preconditioning remaining.The discrete tangent operator described by (19) can be implemented in one oftwo ways, namely through a classical matrix-based approach or, alternatively, amatrix-free approach. For the matrix-based approach adopted in this study, theTrilinos [17] library was leveraged; the system tangent matrix is stored in a dis-tributed Epetra FECrsMatrix from the Epetra package [18] sparse data structureand the linear solver was preconditioned with an algebraic multigrid (AMG) pre-conditioner from the ML package [19]. This is the most highly performant matrix-based approach using the Trilinos framework that, to date, is possible in deal.II .The second implementation, which is discussed in detail in the following Sec-tions, utilized a matrix-free approach in conjunction with a geometric-multigridpreconditioner. As will be detailed later, the pre-existing matrix-free frameworkthat was used o ﬀ ers further opportunity for low-level parallelism that we exploit.8 . Matrix-free operator evaluation Classically, in order to solve (18) the matrix A and the residual force vector F corresponding to the discretization are assembled (that is, the non-zero ma-trix elements A i j are individually computed and stored) and a direct or iterativesolver is used to solve the linear system. In this case, the matrix-vector product Ax results from the product of the individual matrix elements with the elementsin the vector, and is usually implemented with the aid of specialized linear alge-bra packages. On modern computer architectures however, loading sparse matrixdata into the CPU registers is signiﬁcantly slower than performing the arithmeticoperations when solving. For this reason, recent implementations often focus onso-called matrix-free approaches where the matrix is not assembled but rather theresult of its operation on a vector is directly calculated. The idea is to perform theoperations within the solver on-the-ﬂy rather than loading matrix elements frommemory. For iterative solvers, this is possible since it is su ﬃ cient to computematrix-vector products alone. The matrix-vector product Ax (i.e. the action ofoperator A on a vector x ) can be expressed by( Ax ) i = (cid:88) j a ( N i , N j ) x j ≈ (cid:88) K (cid:88) q (cid:88) j ˜ a ( N i , N j )( ξ q ) x j w q J Kq (21)where ˜ a ( N i , N j )( ξ q ) is a representation for the evaluation of the bilinear form inone quadrature point ξ q , J Kq is the Jacobian of the mapping from the isoparametricelement to the element K and w q is the quadrature weight. For the sake of demon-stration, let us assume that A represents a mass operator with scalar valued shapefunctions { N i ( x ) } . In this case, it holds that˜ a ( N i , N j )( ξ q ) = N i ( ξ q ) N j ( ξ q ) . Further assuming that the number of quadrature points n in one-dimension matchesthe degree of the ansatz space plus one, the evaluation of all of the shape functions( n d ) in all quadrature points ( n d ) has complexity O ( n d ) in d space dimensions. Forthe total matrix-vector product we get O ( n d ), and this is the same as for a regu-lar matrix-vector product where the entries are already precomputed. Assumingtensor product ansatz spaces and a tensor product quadrature rule, evaluation ofthis operations can be performed more e ﬃ ciently using a technique called “sumfactorization”. 9 a) Nodes of cubic Lagrange element (b) Gaussian quadrature points Figure 2: Illustration of tensorial indexing for nodes and quadrature points. In this sketch anode / quadrature point with the number 7 can be indexed with a multi-index { , } . Let us illustrate this in 2D. Tensor-product quadrature points can be expressedas a combination of one dimensional quadrature points ξ q = ( (cid:101) ξ q , (cid:101) ξ q ) , (22)where for each quadrature points q we can associate a multi-index ( q , q ). Weightsassociated with the quadrature formula w q can be expressed as a product of one-dimensional weights w q = (cid:101) w q (cid:101) w q . (23)Shape functions can also be expressed as product of one dimensional basis func-tions N i ( ξ q ) = (cid:101) N i ( (cid:101) ξ q ) (cid:101) N i ( (cid:101) ξ q ) (24)where for each DoF index i we can associate a multi-index ( i , i ). See Figure2 for an illustration. Therefore, the result of the application of the scalar-valuedmass operator, represented as a vector y i , on a single element K can be written as y i ≡ Y i i = (cid:88) q (cid:88) q (cid:88) j (cid:88) j (cid:101) N i ( (cid:101) ξ q ) (cid:101) N i ( (cid:101) ξ q ) (cid:101) N j ( (cid:101) ξ q ) (cid:101) N j ( (cid:101) ξ q ) X j j (cid:101) w q (cid:101) w q J Kq q = (cid:88) q (cid:101) N i ( (cid:101) ξ q ) (cid:101) w q (cid:88) q (cid:101) N i ( (cid:101) ξ q ) (cid:101) w q J Kq q (cid:88) j (cid:101) N j ( (cid:101) ξ q ) (cid:88) j (cid:101) N j ( (cid:101) ξ q ) X j j  ∀ i i x j ≡ X j j ; the former accesses the element source vector using a linearindex j , whereas the latter uses the associated multi-index accesses { j j } . Thesefour loops all have the same structure. We either keep the shape function ﬁxedand iterate over the quadrature points in one spatial direction (in the ﬁrst twosums) or keep the quadrature point ﬁxed and iterate over all the shape functionsin one spatial direction (in the last two sums). Since each of these 2 × d loopshas a complexity of n and we perform them for n d values in quadrature points orvalues in degrees of freedoms simultaneously, this results in an algorithm withan arithmetic complexity of O ( n d + ). Hence, we can expect that such a matrix-free approach is faster than a regular matrix-based approach, especially for the3D case, and where the discretization and evaluation employs high polynomialdegree basis functions with the corresponding quadrature rule. More informationon matrix-free techniques can be found in [1, 20].For the performance of the matrix-free operator, it is crucial to ﬁnd a goodintermediate point between precomputing everything (caching) and evaluating allthe quantities on-the-ﬂy. In order to obtain gradients with respect to the spatialconﬁguration, that appear in the tangent operator (19) of the ﬁnite-strain elasticityproblem derived with a Lagrangian formulation, we used the MappingQEulerian class from the deal.II [11] library. In addition to the standard mapping fromthe reference (isoparametric) element to the element in real space, this class com-putes the mapping to the spatial conﬁguration given a displacement ﬁeld. Dueto the speciﬁcs of matrix-free operator evaluation implemented in deal.II , theintegration is eventually performed over the spatial conﬁguration. Therefore, wehave to additionally divide by the Jacobian J of the deformation map at a givenquadrature point to express the integral in the referential conﬁguration.Below, we propose three algorithms to implement the tangent operator ofﬁnite-strain elasticity (see (19)) using the matrix-free approach, each of whichhave a di ﬀ erent level of abstraction, algorithmic complexity, and memory require-ments. They are introduced in order of lowest memory footprint to largest:Algorithm 1 caches a single scalar which involves the logarithm of the Jaco-bian – a computationally demanding operation compared to additions and multi-plications. As evident from Section 2, we need to re-evaluate the Kirchho ﬀ stressat each quadrature point in order to apply the tangent operator. This in turn re-quires evaluating the deformation gradient F and therefore gradients with respectto the referential conﬁguration.In order to avoid recalculation of the Kirchho ﬀ stress, Algorithm 2 cachesits value for each element and quadrature point. In this case, we also avoid theneed to evaluate gradients of the displacement ﬁeld with respect to the referential11 lgorithm 1: Matrix-free tangent operator: cache scalar quantities only

Given :

Source FE vector x , current FE solution u h , cached c : = µ − λ log( J ) for each cell and quadrature point Return: action of the FE tangent operator (19) on x foreach element K ∈ Ω h do extract local vector values on this element u hK , x K ; evaluate the following tensors at each quadrature point using sumfactorization : Grad u hK ; // 2nd order g s : = grad s x K ; // 2nd order symmetric g : = grad x K ; // 2nd order foreach quadrature point q on K do evaluate F = I + Grad u h ; // 2nd order evaluate J = det(F) ; // scalar evaluate b = F · F T ; // 2nd order symmetric evaluate τ = µ b − c I ; // 2nd order symmetric evaluate G g : = g · τ / J ; // 2nd order queue G g for contraction grad N i : G g ; evaluate C g s : = (cid:2) c g s + λ tr(g s )I (cid:3) / J ; // 2nd ordersymmetric queue C g s for contraction grad s N i : C g s ; end evaluate queued contractions using sum factorization ; distribute results to the destination vector end conﬁguration. This algorithm also utilizes the chosen constitutive relationship inthat the operation of J C remains directly expressed using (16). Therefore, theaction of the material part of the fourth-order spatial tangent sti ﬀ ness tensor J C on the second-order symmetric tensor grad s x is cheap to evaluate. To that end, wecache two scalars that depend on the Jacobian J .Finally, Algorithm 3 represents the most general case which does not assumeany form of the material part of the fourth-order spatial tangent sti ﬀ ness tensor. Inthis case, we have to cache the fourth-order symmetric tensor C and the Kirchho ﬀ stress τ for each element and quadrature point, and perform the double contraction12 lgorithm 2: Matrix-free tangent operator: cache second-order Kirchho ﬀ stress τ and thereby avoid the need to evaluate referential quantities like F at when executed. Given :

Source FE vector x , cached c : = (cid:2) µ − λ log( J ) (cid:3) / J , c : = λ/ J and τ / J for each cell and quadrature point, evaluated based on u h Return: action of the FE tangent operator (19) on x foreach element K ∈ Ω h do extract local vector values on this element x K ; evaluate the following tensors at each quadrature point using sumfactorization : g s : = grad s x K ; // 2nd order symmetric g : = grad x K ; // 2nd order foreach quadrature point q on K do evaluate G g : = g · [ τ / J ] ; // 2nd order queue G g for contraction grad N i : G g ; evaluate C g s : = c g s + c tr(g s )I ; // 2nd order symmetric queue C g s contraction grad s N i : C g s ; end evaluate queued contractions using sum factorization ; distribute results to the destination vector end with the second-order symmetric tensor grad s x on-the-ﬂy.Note that the SIMD vectorization in deal.II [11] is applied at the ﬁnite el-ement level, i.e. the matrix-free operator is applied simultaneously on severalelements (called “blocks”). The main reason is that, apart from the point-wisecomputation of the stresses and elastic tangents, the operations are typically thesame on all elements . For a given quadrature point, we simultaneously perform In general, not all quadrature points in each element group are endowed with identical, non-dissipative constitutive laws that follow the same code path during evaluation or the kinetic quanti-ties and tangents. This motivates the caching strategy employed in Algorithm 3, as the pre-cachingof the chosen data ensures a context in which SIMD parallelization of subsequent operations isguaranteed. Algorithms 1 and 2 sacriﬁce some of this generality for decreased memory require-ments. lgorithm 3: Matrix-free tangent operator: cache material part of thefourth-order spatial tangent sti ﬀ ness tensor C and Kirchho ﬀ stress τ . Given :

Source FE vector x , cached τ / J and C for each cell and quadraturepoint, evaluated based on u h Return: action of the FE tangent operator (19) on x foreach element K ∈ Ω h do extract local vector values on this element x K ; evaluate the following tensors at each quadrature point using sumfactorization : g s : = grad s x K ; // 2nd order symmetric g : = grad x K ; // 2nd order foreach quadrature point q on K do evaluate G g : = g · [ τ / J ] ; // 2nd order queue G g for contraction grad N i : G g ; evaluate C g s : = C : g s ; // 2nd order symmetric queue C g s for contraction grad s N i : C g s ; end evaluate queued contractions using sum factorization ; distribute results to the destination vector end tensor evaluations and contractions on all elements . For example, the deforma-tion gradient F = I + Grad u h (second-order tensor) is evaluated at the speciﬁcquadrature point of all elements within the “block”. This is achieved using tem-plated number types within the Tensor class of the deal.II library. In particular, deal.II provides the

VectorizedArray templated generic class that deﬁnes a uni-ﬁed interface to a vectorized data type and overloads basic arithmetic operationsbased on compiler intrinsics. The core of the sum factorization algorithms is pro-vided by the

FEEvaluation class, that already supports all the necessary contrac-tion (such as grad s N i : C g s ) to implement the three algorithms proposed above.Thanks to the object-oriented design of the deal.II library, the core of those For the heterogeneous materials considered here, the elements are ﬁrst grouped according tothe material type (matrix or inclusion) and then the SIMD vectorization is applied for each groupof elements. This functionality is provided by the deal.II library in the

MatrixFree class. deal.II (including hanging constraints, storage of the inverseJacobian of the transformation from unit to real cell, MPI and distributed memoryparallelization as well SIMD vectorization) and performance tests, see [1].

6. Geometric multigrid preconditioning E ﬃ cient and scalable matrix-vector products are not enough to obtain a methodsuitable for large computations. In particular, linear scaling with respect to thenumber N of unknown DoFs is desirable. The e ﬃ ciency of standard iterativesolvers and preconditioners deteriorates for large systems of linear equations dueto increasing iteration counts under mesh reﬁnement. In contrast to other precon-ditioners, geometric multigrid (GMG) preconditioners [2, 21–23] can be proven toresult in iteration counts that are independent of the number of mesh reﬁnementsby smoothing the residual on a hierarchy of meshes. For further information aboutgeometric multigrid methods we refer the reader to [22, 24, 25]. Implementationaldetails of the MPI-parallel multigrid method in deal.II are discussed in [8].For this work, we adopt the matrix-free level operators within the GMG. Ingeneral, given a heterogeneous non-linear material, the deﬁnition of transfer op-erators (for restriction and prolongation) and level operators is not trivial. Oneapproach is to start from an arbitrary heterogeneous material at the ﬁne scale andemploy homogenization theory [26–29] to design suitable transfer operators [30] .In this approach, the ﬁne-scale displacement is decomposed into long-wave andshort-wave contributions by an additive split, where the former is associated to thehomogeneous response of a patch of elements and the latter represents ﬂuctuationsdue to the heterogeneous structure of the patch. However, for a patch of elementsconsisting of the same material, the transfer operators reduce to the standard ge-ometric transfer. Based on this observation we adopt the following approach: Itis assumed that the mesh at the coarsest level can accurately describe the hetero- Finite dimensional linear operators from a vector space associated with the ﬁne-scale mesh tothe coarse-scale mesh, and in the opposite direction. We note that this approach is limited to the case where for each quadrature point within anelement the fourth-order tangent sti ﬀ ness tensor is the same. u h . We then restrict thisdisplacement ﬁeld to all multigrid levels { l } and evaluate tangent operators usingthe matrix-free algorithms from Section 5. Essentially, level operators correspondto the linearization around the smoothed representation of displacement ﬁeld .To summarize, in this section we propose a method by which to construct leveltangent operators for ﬁnite-strain elasticity to be employed within the GMG pre-conditioner. The other parts of the GMG preconditioner such as level smoothers,matrix-free interlevel transfer operators and the coarse level iterative solver areprovided by the deal.II library. The e ﬃ ciency of the preconditioner with theconsidered level operators will be assessed by numerical examples in the nextsection. We will, however, refrain from studying the machine performance as-pects of the multigrid preconditioner as a whole as the focus of this paper is thematrix-free implementation of tangent operators of ﬁnite-strain elasticity both atthe ﬁne mesh level as well as the coarser multigrid levels. Finally, we note thatat the time of writing, a p − version of the GMG was not available within the deal.II library and therefore not considered here.For examples of other approaches that combine homogenization and multigridmethods, see [31, 32] for applications using small strain elastic materials and [33]with applications in fracture.

7. Numerical examples

In this section, we apply the proposed matrix-free operator algorithms forﬁnite-strain hyperelastic materials as well as the geometric multigrid precondi-tioner to a benchmark problem from [30]. The coarse meshes for the 2D and 3Dproblems are illustrated in Figure 3. The 2D material consists of a square (de-picted in blue), a hole and two spherical inclusions (depicted in red). The sizeof the square domain is 10 − mm. The matrix material is taken to have Poisson’sratio 0 . µ = . × N / mm . The inclusion is taken to Note that this di ﬀ ers from the classical approach where the (tangent) operator A l + on level l + A l on level l via A l + = I l + l A l I ll + , where I ll + and I l + l are global prolongation (coarse-to-ﬁne) and restriction (ﬁne-to-coarse) operators, respectively. a) 2D coarse mesh (b) 3D coarse mesh Figure 3: Discretization of the heterogeneous material at the coarsest mesh level and the prescribedboundary conditions. be 100 times sti ﬀ er. The 3D material is obtained by extrusion of the 2D geome-try into the third dimension. Note that although a relatively coarse mesh is usedat the coarsest multigrid level, the geometry of the inclusions is captured accu-rately by manifold descriptions of the boundaries and interfaces. Both domainsare fully ﬁxed along the bottom surface and a distributed load is applied at the topin the (1 ,

0) or (1 , ,

0) direction for the 2D and the 3D problems, respectively. Theforce density 12 . × N / mm or 12 . √ × N / mm is applied in 5 steps forthe 2D and the 3D problems, respectively. With respect to the nonlinear solver,the displacement tolerance of the (cid:96) norm for the Newton update is taken to be10 − whereas the relative and absolute tolerances for the residual forces is 10 − .The relative convergence criteria for the linear solver is 10 − . Figure 4 shows de-formed meshes at the ﬁnal loading step with quadratic elements and two globalmesh reﬁnements. The same distributed approach to mesh decomposition (asmost evident on the ﬁnest level) is adopted on the multigrid levels. Figure 5 il-lustrates the hierarchy of multigrid meshes for the 2D problem after two globalreﬁnements for the mesh introduced in ﬁgure 1.Aligned with the approach taken in similar previous investigations, comparee.g. [9], we limit ourselves to comparisons made on a node level in this study.Examinations performed in this manner should already indicate the di ﬀ erencesbetween the matrix-free and matrix-based approaches, and give some preliminaryunderstanding of the scalability potential of the numerical implementation without17 q N gre f N el N DoF

Table 1: Parameters for the 2D benchmark: p is the polynomial degree, q is the number of quadra-ture points in 1D, N gre f is the number of global mesh reﬁnements, N el is the number of elementsand N DoF is the number of DoFs. p q N gre f N el N DoF

Table 2: Parameters for the 3D benchmark: p is the polynomial degree, q is the number of quadra-ture points in 1D, N gre f is the number of global mesh reﬁnements, N el is the number of elementsand N DoF is the number of DoFs. a) 2D deformed mesh (b) 3D deformed mesh Figure 4: Deformed meshes at the ﬁnal loading step with quadratic elements and two global meshreﬁnements.Figure 5: 2D multigrid mesh after two global reﬁnements distributed into four MPI processes. Thecolor indicates the process owning the element. confounding factors related to data exchange through interconnects via MPI. Thebenchmark calculations are performed for 2D and 3D problems with various com-binations of polynomial degrees and number of global reﬁnements, as is stated inTables 1 and 2. We chose a problem size around 10 DoFs for each test case as thiswas an upper limit due to the memory requirements of the matrix-based approachwith a high polynomial order ﬁnite element. Even so, the problem size is largeenough so that the sparse matrix (in excess of 1 Gb when using the lowest poly-nomial order ﬁnite element discretization) will not ﬁt into the L3 cache (on the19rder of tens of Mb in size), and therefore we can observe that the matrix-basedapproach is memory bound.We conduct two performance studies using the described discretization, theﬁrst being for only matrix-vector multiplication, and the second for the precon-ditioned iterative linear solver. The results of the matrix-free approach with ageometric multigrid preconditioner are compared to the matrix-based approach inconjunction with the algebraic multigrid (AMG) preconditioner using the packageML [19] of the Trilinos [17] library version 12.12.1. The aggregation thresholdfor the AMG preconditioner is taken as 10 − . The geometric multigrid algorithmuses a V-cycle with two pre- and post-smoothing steps using Chebyshev polyno-mials [34] of fourth degree, provided by the deal.II library via the Precondi-tionChebyshev class. The largest eigenvalue λ max of the level operators is esti-mated from 30 iterations of the CG solver and the smoother is set to target therange [0 . λ max , . λ max ]. On the coarsest level, we use the Chebyshev precondi-tioner as a solver [34] to reduce the residual by three orders of magnitude. Sincethis smoother only uses diagonal elements of the operator it is readily compatiblewith the matrix-free approach while most default smoothers rely on an explicitmatrix form.Computations were performed on a single node of two clusters: A node onthe “Emmy” cluster at RRZE, FAU has two Xeon 2660v2 “Ivy Bridge” chips (10cores per chip + SMT) running at 2.2 GHz with 25 MB shared cache per chip and64 GB of RAM. Intel’s compiler version 18.03 with ﬂags “-O3 -march = native”and Intel MPI version 18.03 are used. The “IWR” results were obtained on asingle machine with eight Xeon E7-8870 “Sandy Bridge” chips (10 cores per chip + SMT) running at 2.4 GHz with 30 MB shared cache per chip. For the simulation,GCC version 8.1.0 with ﬂags “-O3 -march = native” and OpenMPI version 3.0.0were used. Unless noted otherwise, the simulations are performed with 20 MPIprocesses, i.e. one fully occupied node. Figures 6 and 7 show the results of the matrix-vector multiplication for theconsidered ﬁnite-strain hyperelastic benchmark problem as performed on the twoclusters. We are interested in the wall-clock time (average over 10 runs per eachlinear solver step) per DoF as a ﬁrst metric. As expected, the matrix-vector multi-plication becomes very expensive for higher order discretization for sparse matrix-based approaches, as the matrix becomes more and more dense. The performanceresults for ﬁnite-strain elasticity operators considered in this study are qualita-tively similar to those reported for the Laplace operator in [1]. Note that we20 v m l t w a ll t i m e ( s ) / D o F (a) matrix-vector product (2D) v m u l t a ll t i m e ( s ) / D o F (b) matrix-vector product (3D) m e m o r y ( M b ) / D o F (c) memory consumption (2D) m e m o r ( M b ) / D o F (d) memory consumption (3D) Figure 6: Emmy cluster, RRZE, matrix-vector multiplication. consider only non-Cartesian meshes together with additional mappings (namely,manifold descriptions of the boundaries and interfaces [35] to capture the geom-etry as well as the linearization-related mapping required for evaluation of gradi-ents with respect to the deformed conﬁguration). Both factors increase the mem-ory consumption and make element operations more expensive. Nonetheless, thematrix-free implementation is faster than the matrix-based counterpart already forbi-cubic Lagrangian basis in 2D and tri-quadratic in 3D. Figures 6(a), 6(b), 7(a)and 7(b) clearly show the inﬂuence of the di ﬀ erent caching strategies, describedin Algorithms 1, 2 and 3. Two conclusions can be drawn from these results: Thescalar caching implementation is the most time consuming of the three, as at eachstep we additionally evaluate the gradient of the displacement ﬁeld in the referen-tial conﬁguration that is required to evaluate the Kirchho ﬀ stress τ . There is little21 v m l t w a ll t i m e ( s ) / D o F (a) matrix-vector product (2D) m u l t w a ll t i m e ( s ) / D o F (b) matrix-vector product (3D) m e m o r y ( M b ) / D o F (c) memory consumption (2D) m e m o r y ( M b ) / D o F

1e 2 TrilinosMF scalarMF tensor2MF tensor4 (d) memory consumption (3D)

Figure 7: IWR cluster, matrix-vector multiplication di ﬀ erence in terms of the wall-clock time per DoF between the approach wherethe material part of the fourth-order spatial tangent sti ﬀ ness tensor is cached (“ten-sor4”) and the one where we only cache the second-order Kirchho ﬀ tensor andutilize the chosen hyperelastic constitutive equation to e ﬃ ciently implement theaction of the material part of the fourth-order spatial tangent sti ﬀ ness tensor on thesecond-order symmetric tensor (“tensor2”). This indicates that the time savingswe get from the latter approach are insigniﬁcant within the complete matrix-freeoperator implementation. Consequently, we can apply the matrix-free approachto any ﬁnite-strain material model by caching the material part of the fourth-orderspatial tangent sti ﬀ ness tensor in addition to the Kirchho ﬀ stress and expect this tobe faster than the matrix-based implementation. It is therefore feasible to deﬁnehighly performant generic operators that are independent of any applied constitu-22ive laws, as long as the resulting tangent operator is expressed via (19).As outlined in the introduction, it is not only the performance of the matrix-free vector multiplication, but also the memory requirement to store a sparse ma-trix which is the driving force behind matrix-free methods. Figures 6(c), 6(d), 7(c)and 7(d) demonstrate that even with caching one fourth-order symmetric tensorand one second-order tensor at each quadrature point (“tensor4”), the matrix-freeapproach takes much less memory than its matrix-based counterpart. In thosegraphs, we report the overall memory needed to cache additional quantities re-quired for the tangent operator here considered, as well as the memory requiredby the deal.II MatrixFree object, which caches the inverse of the Jacobian ofthe transformation from unit to real cell [1]. Note that for Algorithm 1 we need toevaluate gradients with respect to both deformed and undeformed conﬁgurations,and therefore two inverses will be cached via two di ﬀ erent MatrixFree objects.Finally, we analyze the node-level performance of the two approaches usingthe rooﬂine model. Memory bandwidth (MBytes / s) and the number of ﬂoatingpoint operations per second (MFLOP / s) are measured using the MEM DP groupof the LIKWID [36] tool, version 4.2.1. The measurements are done on a sin-gle socket of RRZE cluster by pinning all 10 MPI processes to this socket, whosememory bandwidth and peak performance are B = . / s and P =

176 GFlop / s,respectively. Turbo mode of the CPU was disabled.Figure 8 shows results of the rooﬂine performance model. As expected, thematrix-based approach is memory bound and (on average) achieves the perfor-mance of 8 .

21 GFlop / s (4.6% of the peak performance) in 2D. The matrix-freeimplementations of the ﬁne-strain elastic tangent operator (on average) lead toa much better performance of 18 . / s in 2D, as well as about an order ofmagnitude higher computational intensity, but still stays in the memory-boundregime. The achieved performance is about 10.6% of the peak arithmetic perfor-mance. This is much lower than the 70% reported for the Laplace operator onCartesian meshes in [1], but similar to what was obtained in [9, Figure 14] whennot optimizing for Cartesian meshes. Unfortunately, performance measurementsfor non-Cartesian meshes were not reported in [1].When comparing the data for “tensor4” caching strategy (Algorithm 3) to thatfrom the “tensor2” (Algorithm 2) caching strategy, we observe a marginal increasein computational intensity and almost unchanged performance. This coincideswith our expectation that the latter algorithm requires transferring fewer bytes to As measured with the load avx benchmark of likwid-bench [36]. /8 1/4 1/2 1 2 4 8 16intensity (Flop/byte)48163264128256 p e r f o r m a n c e ( G F l o p / s ) B = . G B / s Peak DPw/o FMAw/o SIMDMF scalar (p=2)MF scalar (p=4)MF scalar (p=6)MF tensor2 (p=2)MF tensor2 (p=4)MF tensor2 (p=6)MF tensor4 (p=2)MF tensor4 (p=4)MF tensor4 (p=6)Trilinos (p=2)Trilinos (p=4)Trilinos (p=6) (a) 2D p e r f o r m a n c e ( G F l o p / s ) B = . G B / s Peak DPw/o FMAw/o SIMDMF scalar (p=2)MF scalar (p=4)MF tensor2 (p=2)MF tensor2 (p=4)MF tensor4 (p=2)MF tensor4 (p=4)Trilinos (p=2)Trilinos (p=4) (b) 3D

Figure 8: Rooﬂine model for a selection of polynomial degrees. ‘Peak DP’ denotes the peak doubleprecision performance ( P ), ‘w / o FMA’ represents a ceiling without Fused Multiply-Add ( P / / o SIMD’ is a ceiling where both FMA and SIMD vectorization are discarded ( P / perform operations on a single quadrature point (compare line 9 in Algorithm 3that loads a symmetric rank-4 tensor to the line 9 in Algorithm 2 that evaluatesthe same contraction by loading two scalars c and c ). On the other hand, the“scalar” caching strategy (Algorithm 1) demonstrates a higher computational in-tensity. This is not surprising as at each quadrature point we need to re-evaluatethe Kircho ﬀ stress τ . Given that in this case we need to evaluate gradients withrespect to both deformed and undeformed conﬁgurations, an additional second-order tensor per quadrature point (inverse of the Jacobian transformation) will becached as compared to the Algorithm 2. As a result both algorithms will havecomparable memory requirements (also conﬁrmed by Figures 6(c), 6(d), 7(c) and7(d)), however the “scalar” algorithm performs a large number of repetitive oper-ations that are independent on the right hand side vector (steps 8–11 in Algorithm1), which makes it the worst option among the three.In order to have a better understanding of the behavior of the most promising“tensor4” implementation, we collect its performance data and computing timesfor various stages inside the element loop (see Algorithm 3): (i) ‘read / write’denotes reading and writing from / to a global vector (lines 2 and 13); (ii) ‘sumfactorization’ denotes application of sum factorization techniques (lines 4, 5 and12); (iii) ‘quadrature loop’ denotes operations performed on each quadrature (lines6 – 11). Given that the measurements are now down within the cell loop, wecollected results on smaller meshes by adopting only one global reﬁnement in2D and no global mesh reﬁnement in 3D. It is clear from Figure 9(c) and 9(d)24 /161/8 1/4 1/2 1 2 4 8 16 32 64intensity (Flop/byte)1248163264128256 p e r f o r m a n c e ( G F l o p / s ) B = . G B / s Peak DPw/o FMAw/o SIMDquadrature loop (p=2)quadrature loop (p=4)quadrature loop (p=6)read / write (p=2)read / write (p=4)read / write (p=6)sum factorization (p=2)sum factorization (p=4)sum factorization (p=6) (a) Rooﬂine (2D) p e r f o r m a n c e ( G F l o p / s ) B = . G B / s Peak DPw/o FMAw/o SIMDquadrature loop (p=2)quadrature loop (p=4)read / write (p=2)read / write (p=4)sum factorization (p=2)sum factorization (p=4) (b) Rooﬂine (3D) w a ll t i m e ( s ) / nu m b e r o f e l e m e n t s

1e 4 sum factorizationread / writequadrature loop (c) Computing times (2D) w a ll t i m e ( s ) / nu m b e r o f e l e m e n t s

1e 3 sum factorizationread / writequadrature loop (d) Computing times (3D)

Figure 9: Breakdown of computing times and rooﬂine analysis of various steps of Algorithm 3. that the ‘read / write’ steps take a negligible amount of time on each element.For higher order bases, the majority of time is spent within the quadrature loop.When analyzing the rooﬂine data (Figure 9(a) and 9(b)), we observe that this partof the algorithm demonstrates rather low performance, which in our opinion isthe source of the suboptimal performance of the whole matrix-free matrix-vectorproduct (see Figure 8). In this case the, quadrature loop consists of, essentially, asingle contraction between two second order tensors (step 7 in Algorithm 3) anda double contraction between a symmetric second order tensor and a fourth ordertensor (step 9 in Algorithm 3).Table 3 and Table 4 show the speed-up obtained due to explicit SIMD vector-ization intrinsics and due to MPI. For these rather small count of MPI processes,the results conﬁrm that the performance gained is independent. Furthermore, the25 erial MPI SIMD MPI + SIMDp time GFlop / s time GFlop / s speedup time GFlop / s speedup time GFlop / s speedup2 0 .

72 0 .

87 7 . · − .

57 9 .

65 0 .

39 1 .

65 1 .

82 4 . · − .

01 17 .

314 0 .

48 1 .

31 5 . · − .

39 9 .

32 0 .

26 2 .

37 1 .

85 2 . · − .

24 17 .

096 0 .

25 1 .

55 2 . · − .

81 9 .

36 0 .

13 2 .

85 1 .

89 1 . · − .

05 16 . Table 3: Wall-clock time in seconds and performance in GFlops of Algorithm 3 in 2D for variouscombinations of polynomial degrees, vectorization and parallelization. serial MPI SIMD MPI + SIMDp time GFlop / s time GFlop / s speedup time GFlop / s speedup time GFlop / s speedup2 1 .

83 1 .

44 0 .

19 13 .

76 9 .

54 1 .

25 2 .

24 1 .

47 0 .

13 21 .

02 13 .

654 1 .

15 1 .

75 0 .

12 16 .

62 9 . .

75 2 .

75 1 .

53 8 . · − .

49 14 . Table 4: Wall-clock time in seconds and performance in GFlops of Algorithm 3 in 3D for variouscombinations of polynomial degrees, vectorization and parallelization. improvement due to explicit intrinsics for the data used in the quadrature loop, iscomparable to the ones in [37, Figure 6] where the speedup factor is 1 . − . Next, we evaluate the performance of the proposed geometric multigrid pre-conditioner. Our main goal is to examine whether or not the adopted level oper-ators lead to an e ﬃ cient preconditioner from the linear algebra perspective. Tothat end, we consider the average number of CG iterations throughout the entiresimulation (i.e. each Newton-Raphson iteration and each loading step). Figures10(a), 10(b), 11(a) and 11(b) show a noticeable increase in the average number ofCG iterations for di ﬀ erent polynomial degrees for the GMG preconditioner. Al-though not reported in this section, we identiﬁed that the coarse level solver setupwill greatly a ﬀ ect the performance of the preconditioner. In particular, we canimprove the number of iterations by choosing a more accurate coarse level solver,however this leads to an overall more expensive linear solver. However, the set-tings used in this study are optimized for the considered problem in terms of the26 a v e r a g e nu m b e r o f C G i t e r a t i o n s TrilinosMF (a) CG iterations (2D) a v e r a g e nu m b e r o f C G i t e r a t i o n s TrilinosMF (b) CG iterations (3D) w a ll t i m e ( s ) / D o F (c) CG solution time (2D) w a ll i m e ( s ) / D o F (d) CG solution time (3D) Figure 10: Emmy cluster, RRZE, iterative solver. wall-time spent in the linear solver. The black-box AMG preconditioner demon-strates the same trend in terms of the number of CG iterations, but requires manymore iterations for convergence. Consequently, we can conclude that for the hereconsidered ﬁnite strain elasticity problem (i) the selected smoother is suitable, andthat (ii) although the adopted level operators are not built using the triple product A l + = I l + l A l I ll + , they still result in a multigrid preconditioner which requiresmuch fewer CG iterations as compared to the algebraic multigrid.Additionally, we compare solution times of the preconditioned iterative solvers.Our goal is to determine whether the matrix-free approach with a GMG precondi-tioner can be faster than the matrix-based approach with a well-established imple-mentation of an AMG preconditioner. We note speciﬁcally that the comparison isrestricted to the choice of level operators used with the pre-existing GMG frame-27 a v e r a g e nu m b e r o f C G i t e r a t i o n s TrilinosMF (a) CG iterations (2D) a v e r a g e nu m b e r o f C G i t e r a t i o n s TrilinosMF (b) CG iterations (3D) w a ll t i m e ( s ) / D o F (c) CG solution time (2D) w a ll t i m e ( s ) / D o F (d) CG solution time (3D) Figure 11: IWR cluster. work, and not the framework itself. Figures 10(c), 10(d), 11(c) and 11(d) conﬁrmthat the e ﬃ ciency of the proposed GMG preconditioner translates into faster so-lution times for all polynomial degrees. Most importantly, the wall time per DoFfor the matrix-free solver with GMG preconditioner stays almost constant both for2D and 3D problems. In comparison, the wall time of the matrix-based iterativesolver grows rapidly with increasing polynomial degree. Clearly this is relatedboth to the performance of the preconditioner from the linear algebra perspectiveas well as the matrix-vector products, studied in the previous section. Based onthese results, we conclude that the matrix-free approach with the adopted heregeometric multigrid preconditioner can deliver a very competitive solution strat-egy for engineering problems with compressible hyperelastic ﬁnite strain materialmodels. 28ores N DoF N gre f vmult [s] CG [s / iteration]20 4,442,880 3 6 . · − . . · − . . · − . Table 5: Weak scaling of Algorithm 3 in 3D for quadratic polynomial basis.

Extension of this study from the node level to the cluster level is beyond thescope of this work. However, as a step in this direction we have performed apreliminary study of the weak scalability of the implementation. Table 5 demon-strates a good weak scaling up to 1280 processes of the wall time per matrix-vector product and per iteration in the CG solver for the 3D problem with a tri-quadratic basis. This conﬁrms that the matrix-free and multigrid frameworks im-plemented in the deal.II library can provide a competitive solution for cluster-sized problems as well, also compare [3, Figure 7] for scaling results for up to147,456 CPU cores and 34.4 billion degrees of freedom for an incompressibleﬂow problem.

8. Summary and Conclusions

In this contribution, we proposed and numerically investigated three di ﬀ erentmatrix-free implementations of tangent operators for ﬁnite-strain elasticity withheterogeneous materials. To the best of our knowledge, this is the ﬁrst work thatapplies matrix-free sum factorization techniques to partial di ﬀ erential equationsarising in this case. Among the three examined algorithms, the implementationthat caches the material part of the fourth-order spatial tangent sti ﬀ ness tensortogether with the second-order Kirchho ﬀ stress was shown to be faster than thematrix-based approach for polynomial degrees higher than one in 3D and poly-nomial degrees higher than two in 2D. Given that we did not utilize the speciﬁcform of the material part of the sti ﬀ ness tensor in this case, we conclude that thematrix-free implementation of the tangent operator can be applied to any constitu-tive model which operates with the material part of the fourth-order spatial tangentsti ﬀ ness tensor on the quadrature point level.The rooﬂine model indicates that the matrix-free ﬁnite-strain elasticity tangentoperators have an order magnitude higher algorithmic intensity. We identiﬁed that This corresponds to 64 nodes, which is the maximum we could request on the RRZE EmmyHPC cluster. homogenization as well studyingits behavior on the cluster level. Acknowledgements

D. Davydov acknowledges the ﬁnancial support of the German Research Foun-dation (DFG), grant DA 1664 / / Multiphysics Computational Mechanics”.30 eferences [1] M. Kronbichler, K. Kormann, A generic interface for parallel cell-based ﬁ-nite element operator application, Computers & Fluids 63 (2012) 135–147.[2] D. A. May, J. Brown, L. Le Pourhiet, A scalable, matrix-free multigrid pre-conditioner for ﬁnite element discretizations of heterogeneous Stokes ﬂow,Computer methods in applied mechanics and engineering 290 (2015) 496–523.[3] B. Krank, N. Fehn, W. A. Wall, M. Kronbichler, A high-order semi-explicitdiscontinuous Galerkin solver for 3D incompressible ﬂow with applica-tion to DNS and LES of turbulent channel ﬂow, Journal of ComputationalPhysics 348 (2017) 634–659.[4] J. Brown, E ﬃ cient nonlinear solvers for nodal high-order ﬁnite elements in3D, Journal of Scientiﬁc Computing 45 (1-3) (2010) 48–63.[5] B. Gmeiner, M. Huber, L. John, U. R¨ude, B. Wohlmuth, A quantitative per-formance study for Stokes solvers at the extreme scale, Journal of Computa-tional Science 17 (2016) 509–521.[6] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou,A. Haidar, I. Karlin, T. Kolev, I. Masliah, et al., High-performance tensorcontractions for GPUs, Procedia Computer Science 80 (2016) 108–118.[7] K. Ljungkvist, M. Kronbichler, Multigrid for matrix-free ﬁnite element com-putations on graphics processors, Tech. rep., Technical Report 2017-006,Department of Information Technology, Uppsala University (2017).[8] T. C. Clevenger, T. Heister, G. Kanschat, M. Kronbichler, A Flexible, Paral-lel, Adaptive Geometric Multigrid method for FEM, in preparation.[9] M. Kronbichler, K. Kormann, Fast matrix-free evaluation of discontinuousgalerkin ﬁnite element operators, arXiv preprint arXiv:1711.03590.[10] S. M¨uthing, M. Piatkowski, P. Bastian, High-performance implementationof matrix-free high-order discontinuous galerkin methods, arXiv preprintarXiv:1711.10885. 3111] G. Alzetta, D. Arndt, W. Bangerth, V. Boddu, B. Brands, D. Davy-dov, R. Gassmoeller, T. Heister, L. Heltai, K. Kormann, M. Kronbichler,M. Maier, J.-P. Pelteret, B. Turcksin, D. Wells, The deal.II Library, Ver-sion 9.0, Journal of Numerical Mathematics.[12] P. Wriggers, Nonlinear ﬁnite element methods, Springer Science & BusinessMedia, 2008.[13] L. R. G. Treloar, The Physics of Rubber Elasticity, Oxford University Press,USA, 1975.[14] L. R. G. Treloar, H. G. Hopkins, R. S. Rivlin, J. M. Ball, The me-chanics of rubber elasticity, Proceedings of the Royal Society A: Math-ematical, Physical and Engineering Sciences 351 (1666) (1976) 301–330.doi:10.1098 / rspa.1976.0144.[15] R. Bartlett, I. Demeshko, T. Gamblin, G. Hammond, M. A. Heroux, J. John-son, A. Klinvex, X. Li, L. C. McInnes, J. D. Moulton, D. Osei-Ku ﬀ uor,J. Sarich, B. Smith, J. Willenbring, U. M. Yang, xSDK foundations: To-ward an extreme-scale scientiﬁc software development kit, SupercomputingFrontiers and Innovations 4 (1). doi:10.14529 / jsﬁ170104.[16] C. Burstedde, L. C. Wilcox, O. Ghattas, p4est : Scalable Algorithms forParallel Adaptive Mesh Reﬁnement on Forests of Octrees, SIAM Journal onScientiﬁc Computing 33 (3) (2011) 1103–1133. doi:10.1137 / // doi.acm.org / / ﬃ ciently: Implement-ing ﬁnite and spectral / hp element methods to achieve optimal performance32or low-and high-order discretisations, Journal of Computational Physics229 (13) (2010) 5161–5181.[21] J. H. Bramble, J. E. Pasciak, J. Xu, Parallel multilevel preconditioners, Math-ematics of Computation 55 (191) (1990) 1–22.[22] W. L. Briggs, V. E. Henson, S. F. McCormick, A Multigrid Tutorial: SecondEdition, Society for Industrial and Applied Mathematics, Philadelphia, PA,USA, 2000.[23] B. Janssen, G. Kanschat, Adaptive Multilevel Methods with Local Smooth-ing for H - and H curl -Conforming High Order Finite Element Meth-ods, SIAM Journal on Scientiﬁc Computing 33 (4) (2011) 2095–2114.doi:10.1137 / / / j.cma.2009.11.018.URL http://eprints.gla.ac.uk/38423/ [34] R. S. Varga, Matrix iterative analysis, 2nd Edition, Springer, Berlin, 2009.[35] D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbich-ler, M. Maier, J.-P. Pelteret, B. Turcksin, D. Wells, The deal.II library,version 8.5, Journal of Numerical Mathematics 25 (3). doi:10.1515 //