[PDF] Learning optimal multigrid smoothers via neural networks

Abstract

Multigrid methods are one of the most efficient techniques for solving linear systems arising from Partial Differential Equations (PDEs) and graph Laplacians from machine learning applications. One of the key components of multigrid is smoothing, which aims at reducing high-frequency errors on each grid level. However, finding optimal smoothing algorithms is problem-dependent and can impose challenges for many problems. In this paper, we propose an efficient adaptive framework for learning optimized smoothers from operator stencils in the form of convolutional neural networks (CNNs). The CNNs are trained on small-scale problems from a given type of PDEs based on a supervised loss function derived from multigrid convergence theories, and can be applied to large-scale problems of the same class of PDEs. Numerical results on anisotropic rotated Laplacian problems demonstrate improved convergence rates and solution time compared with classical hand-crafted relaxation methods.

Full PDF

LLearning optimal multigrid smoothers via neural networks

Ru Huang ∗ Ruipeng Li † Yuanzhe Xi ∗ Abstract

Multigrid methods are one of the most eﬃcient techniques for solving linear systems arising fromPartial Diﬀerential Equations (PDEs) and graph Laplacians from machine learning applications. One ofthe key components of multigrid is smoothing, which aims at reducing high-frequency errors on each gridlevel. However, ﬁnding optimal smoothing algorithms is problem-dependent and can impose challengesfor many problems. In this paper, we propose an eﬃcient adaptive framework for learning optimizedsmoothers from operator stencils in the form of convolutional neural networks (CNNs). The CNNs aretrained on small-scale problems from a given type of PDEs based on a supervised loss function derivedfrom multigrid convergence theories, and can be applied to large-scale problems of the same class of PDEs.Numerical results on anisotropic rotated Laplacian problems demonstrate improved convergence ratesand solution time compared with classical hand-crafted relaxation methods.

Partial Diﬀerential Equations (PDEs) play important roles in modeling various phenomena in many ﬁelds ofscience and engineering. Their solutions are typically computed numerically, when the closed-form solutionsare not easily available, which leads to large-scale and ill-conditioned sparse linear systems to solve. In machinelearning applications such as spectral clustering, graph-based semi-supervised learning and transportationnetwork ﬂows, solving large-scale linear systems associated with graph Laplacians is often needed. Thedevelopment of eﬃcient linear solvers is still an active research area nowadays.Among many numerical solution schemes, multigrid methods can often show superior eﬃciency andscalability especially for solving elliptic-type PDE and graph Laplacian problems [Brandt, 1984, Ruge,1983, Briggs et al., 2000, Trottenberg et al., 2000]. Fast convergence of multigrid is achieved by exploitinghierarchical grid structures to eliminate errors of all modes by smoothing and coarse-grid correction at eachgrid level. Thus, the performance of multigrid methods highly depends on the smoothing property of thechosen smoother. However, the design of optimal smoothing algorithm is problem-dependent and oftentoo complex to be achieved even by domain experts. In this paper, we propose an adaptive framework fortraining optimized smoothers via convolutional neural networks (CNNs), which directly learns a mappingfrom operator stencils to the inverse of the smoothers. The training process is guided by multigrid convergencetheories for good smoothing properties on eliminating high-frequency errors. Multigrid solvers equipped withthe proposed smoothers inherit the convergence guarantees and scalability from standard multigrid and canshow improved performance on anisotropic rotated Laplacian problems that are typically challenging forclassical multigrid methods. Numerical results demonstrate that a well-trained CNN-based smoother candamp high-frequency errors more rapidly and thus lead to a faster convergence of multigrid than traditionalrelaxation-based smoothers. Another appealing property of the proposed smoother and the training frameworkis the ability of generalization to problems of much larger sizes and more complex geometries, which isdiscussed in detail in the following of the paper. ∗ Department of Mathematics, Emory University, 201 Dowman Drive, Atlanta, GA 30322 ( { ru.huang,yxi26 } @emory.edu ).YX is supported by NSF grant OAC 2003720. † Center for Applied Scientiﬁc Computing, Lawrence Livermore National Laboratory, P. O. Box 808, L-561, Livermore, CA94551 ( [email protected]) . This work was performed under the auspices of the U.S. Department of Energy by Lawrence LivermoreNational Laboratory under Contract DE-AC52-07NA27344 (LLNL-PROC-819678). a r X i v : . [ m a t h . NA ] F e b Related work

There is an increasing interest in leveraging machine learning techniques to solve large-scale linear systemsin the past few years. Several researchers have proposed to use machine learning techniques to directlyapproximate the solutions of PDEs. For example, Lagaris et al. 1998 ﬁrst proposed to use neural networks(NNs) to approximate the solutions for both Ordinary Diﬀerential Equations (ODEs) and PDEs with a ﬁxedboundary condition. Later, Tang et al. 2017 utilized CNNs to solve Poisson equations with a simple geometryand Berg and Nystr¨om 2018 extended the techniques to more complex geometries. Han et al. 2018, Sirignanoand Spiliopoulos 2018 applied machine learning techniques to solve high dimensional PDEs, and Wei et al.2019 focused on applying reinforcement learning to solve nonlinear PDEs. Sun et al. 2003 used parameterizedrealistic volume conduction models to solve Poisson equations and Holl et al. 2020 trained a NN to planoptimal trajectories and control the PDE dynamics and showed numerical results for solving incompressibleNavier-Stokes equations.Research has been conducted on leveraging NNs to improve the performance of existing solvers aswell. Schmitt et al. 2019 developed optimization techniques for geometric multigrid based on evolutionarycomputation. Mishra 2018 generalized existing numerical methods as NNs with a set of trainable parameters.Katrutsa et al. 2017 proposed a deep learning method to optimize the parameters of prolongation andrestriction matrices in a two-grid geometric multigrid scheme by minimizing the spectral radius of theiteration matrix. Greenfeld et al. 2019 used NNs to learn prolongation matrices in multigrid in order to solvediﬀusion equations without retraining and Luz et al. 2020 generalized this framework by using algebraicmultigrid (AMG) to solve unstructured problems.On the other hand, researchers have also explored relationships between CNNs and diﬀerential equationsto design better NN architectures. For instance, He and Xu 2019 designed MgNet which uses multigridtechniques to improve CNNs. Haber et al. 2018, Chang et al. 2017 scaled up CNNs by interpreting theforward propagation as nonlinear PDEs.The work by Hsieh et al. 2019 uses CNNs and U-net [Ronneberger et al., 2015] to learn a correction termto Jacobi method which is used directly for solving Poisson equations. In this work, we adopt a similar ideato improve multigrid smoothers, since multigrid methods are known to be more scalable than Jacobi. To thebest of our knowledge, our approach is the ﬁrst attempt to use an individual CNN to learn the smoother ateach level of multigrid with more than two levels and exhibits good generalization properties to problemswith diﬀerent sizes and geometries.

In this section we review the classical convergence theory for iterative methods. We consider solving thefollowing linear system of equations Au = f, (1)where A ∈ R n × n is symmetric positive deﬁnite (SPD) and u, f ∈ R n . Iterative methods generate a sequenceof improving approximations to the solution of (1), in which the approximate solution u k at iteration k depends on the previous ones. Formally, an iterative solver can be expressed as: u k = Φ( u , f, k ) , (2)where the solver Φ : R n × R n × Z → R n is an operator that takes the initial guess u , right-hand side vector f and generates u k at iteration k . Iterations based on relaxation schemes can be written as u k +1 = ( I − M − A ) u k + M − f = Gu k + M − f, G = I − M − A, (3)where M is the relaxation matrix and G is the iteration matrix. Standard relaxation approaches includeweighted Jacobi method with M = ω − D where D denotes the diagonal of A and Gauss-Seidel method with2 = D − L where − L is the strict lower triangular part of A . Denoting by e k = u ∗ − u k the error at iteration k , where u ∗ is the exact solution of (1), it follows that e k = G k e . The following theorem gives a generalconvergence result for lim k →∞ e k = 0. Theorem 3.1 (Saad 2003, Theorem 4.1)

Denote by ρ ( G ) the spectral radius of G . The iteration (3) converges for any initial vector u if and only if ρ ( G ) < . Notice that ρ ( G ) represents the asymptotic convergence rate, which, however, does not, in general, predicterror reduction for a few iterations [Briggs et al., 2000]. Since our interest is to use relaxation as multigridsmoothers, which is typically applied O (1) times in each smoothing step, we have to consider convergentsmoothers deﬁned as follows. Deﬁnition 3.1 (Convergent smoother in energy norm)

Assuming A is SPD, relaxation matrix M iscalled a convergent smoother in the energy norm if (cid:107) Ge k (cid:107) A < (cid:107) e k (cid:107) A , ∀ e k , where G = I − M − A and (cid:107) x (cid:107) A = x T Ax . It can be shown that M is a convergent smoother if and only if (cid:107) G (cid:107) A < M T + M − A is SPD (seeAppendix A). Since ρ ( G ) is easier to compute than (cid:107) G (cid:107) A and ρ ( G ) < ρ ( G ) is often used as a metric of convergence rateof smoothers.Though relaxation schemes can have very slow convergence when being used as a solver, they are knownto be very eﬃcient for smoothing the error. That is, after a few iterations, the remaining error varies slowlyrelative to the mesh grid, and thus can be approximated well on a coarser grid. This property is explored inmultigrid methods as discussed in the next section. Multigrid methods exploit a hierarchy of grids with exponentially decreasing numbers of degrees of freedomon coarser levels, starting with the original problem on the ﬁnest level. On each level, the computationalcost is proportional to the problem size, therefore, the overall complexity is still linear. Smoothing andcoarse-grid correction are the two main components of multigrid, which are designed to be complementary toeach other in order to achieve fast convergence, i.e., they aim at eliminating “high-frequency” (oscillatory)and “low-frequency” (smooth) errors respectively, where high- and low-frequency errors usually correspond toeigenvectors of M − A with large and small eigenvalues. Relaxation-based approaches such as (point-wise)Jacobi and Gauss-Seidel are typical choices of multigrid smoothers as these methods are inexpensive to applyand can eﬀectively remove high-frequency errors for elliptic type PDEs. On the other hand, the eﬀectivenessof coarse-grid correction on low-frequency errors is due to the fact that smooth errors can be interpolatedaccurately.When dealing with hard problems such as ones with irregular anisotropy, anisotropy not aligned alongthe coordinate axes, or complex geometries, eﬃciency of traditional smoothers can deteriorate, in whichcases, stronger and often more expensive smoothers are needed such as block smoothers [Evans and Yousif,1990, Bolten and Kahl, 2012], ILU-based smoothers [Wittum, 1989] and smoothers based on Krylov methods[Bank and Douglas, 1985, Lin et al., 2020]. Nevertheless, ﬁnding robust and eﬃcient smoothers still remainsa challenging problem for multigrid.Convergence theory of two-grid methods has been well studied [Brandt, 1986, Brezina et al., 2001, Falgoutand Vassilevski, 2004, Xu and Zikatanov, 2017] through the error propagation operator E TG of the form: E TG = ( I − M − A )( I − P ( P T AP ) − P T A ) , (4)where M is the smoother, P ∈ R n × n c is the prolongation operator, P T is typically used as the restrictionoperator for symmetric problems, and P T AP is the Galerkin coarse-grid operator. In general, smaller (cid:107) E TG (cid:107) A indicates faster convergence for two-grid methods.In this paper we choose standard prolongation operators P and only focus on using CNNs to parameterize M . The following theorem summarizes the main convergence result in [Falgout and Vassilevski, 2004] withrespect to M and P . 3 heorem 3.2 (Falgout and Vassilevski 2004) Assuming M T + M − A is SPD, denote by ˜ M = M T ( M T + M − A ) − M, (5) the symmetrized smoother. Let R ∈ R n c × n be any matrix such that RP = I and K = max e (cid:54) =0 (cid:107) ( I − P R ) e (cid:107) M (cid:107) e (cid:107) A . (6) We have K ≥ and (cid:107) E TG (cid:107) A ≤ (1 − /K ) / . The quantity K in (6), which is the so-called weak approximation property [Brandt et al., 1985], essentiallymeasures how accurately interpolation approximates the eigenvectors of M − A proportional to the cor-responding eigenvalues. The optimal K yields an ideal uniform bound of convergence rate which can becomputed explicitly as follows. Deﬁnition 3.2 (Ideal uniform bound of convergence rate)

Suppose P takes form P = ( WI ) as instandard multigrid algorithms, where R = (cid:0) I (cid:1) and S T = (cid:0) I (cid:1) . Denoting by K ∗ the minimum K in (6) over P , we deﬁne quantity β ∗ such that β ∗ = (1 − /K ∗ ) = [1 − λ min (( S T ˜ M S ) − ( S T AS ))] , (7) which can be considered as the ideal uniform bound of convergence rate [Falgout and Vassilevski, 2004]. Thisquantity is often used to analyze convergence rate of smoothers in two-grid methods [Baker et al., 2011]. Extension from two-grid methods to multigrid methods is straightforward. This can be done by recursivelyapplying two-grid methods on the coarse-grid system, see Algorithm 1 for a brief description of standardmultigrid V-cycle. Notice that the smoother M ( l ) at level l is only required to eliminate errors that are A ( l ) -orthogonal to Ran( P ( l ) ) in order to have fast convergence. This property will be used to design eﬃcienttraining strategies for learning neural smoothers in the next section. Algorithm 1

Multigrid V-cycle for solving Au = f Pre-smoothing: u ( l ) = u ( l ) + ( M ( l ) ) − ( f ( l ) − A ( l ) u ( l ) ) Compute ﬁne-level residual: r ( l ) = f ( l ) − A ( l ) u ( l ) , and restrict it to the coarse level: r ( l +1) = ( P ( l ) ) T r ( l ) if l + 1 is the last level then Solve A ( l +1) u ( l +1) = r ( l +1) else Call multigrid V-cycle recursively with l = l + 1, f ( l +1) = r ( l +1) and u ( l +1) = 0 end if Prolongate the coarse-level approximation and correct the ﬁne-level approximation: u ( l ) = u ( l ) + P ( l ) u ( l +1) Post-smoothing: u ( l ) = u ( l ) + ( M ( l ) ) − ( f ( l ) − A ( l ) u ( l ) ) The convergence of multigrid V-cycle heavily depends on the choice of smoothers. Classical oﬀ-the-shelfsmoothers such as weighted Jacobi or Gauss-Seidel exhibit near-optimal performance on simple Poissonequations and generally lose their eﬃciency on other types of PDEs. In this section, we formulate the designof smoothers as a learning task and train a single neural network to parameterize the action of the inverse ofthe smoother at a given grid level. The learned smoothers are represented as a sequence of convolutionallayers and trained in an adaptive way guided by the multigrid convergence theory.4 .1 Formulation

We deﬁne a PDE problem as the combination of PDE class A , forcing term F and boundary condition G .To solve the problem numerically on a 2-D square domain, we discretize it on a grid of size N × N , whichleads to solving linear system Au = f where A ∈ R N × N and f ∈ R N . Our goal is to train smoothers M (0) , . . . , M ( L − on the ﬁrst L levels of a multigrid solver that has L + 1 levels. We assume here that themultigrid solver uses the same smoother for both the pre-smoothing and post-smoothing steps (c.f., lines 1and 9 in Algorithm 1, respectively), and uses direct methods as the coarsest-level solver. Denoting by Φ (0) the multigrid hierarchy from level 0, the training objective for Φ (0) is to minimize the error (cid:107) Φ (0) ( u , f, k ) − u ∗ (cid:107) (8)where u is a given initial guess, u ∗ is the exact solution, and u k = Φ (0) ( u , f, k ) is the approximate solutionby performing k steps of V-cycles with Φ (0) .The advantage of minimizing (8) instead of the norm of the associated iteration matrix is that (8) can beevaluated and optimized more eﬃciently. When multiple exact solutions are used to minimize (8) jointly, theconvergence property of the trained smoother can be justiﬁed by the following theorem, which shows thatwhen the loss of (8) is small, the norm of the associated iteration matrix can also be small. Theorem 4.1 (Gudmundsson et al. 1995)

Assume A ∈ R n × n and z ∈ R n is uniformly distributed onunit n -sphere, then we have E ( n (cid:107) Az (cid:107) ) = (cid:107) A (cid:107) F . In this paper, we ﬁx A but vary F and G , and learn multigrid smoothers that are appropriate for diﬀerentPDEs from the same class. Speciﬁcally, we train the multigrid solvers on a small set of discretized problems D = Q (cid:91) j =1 { A, f j , ( u ) j , ( u ∗ ) j } (9)with the presumption that the learned smoothers have good generalization properties: the learned smootherscan perform well on problems with much larger grid sizes and diﬀerent geometries.In particular, we parameterize the action of ( M ( l ) ) − by a neural network, H ( l ) , with only convolutionallayers. This parameterization has several advantages. First, on an N × N grid, H ( l ) only requires O ( N )computation and has a few parameters. Second, H ( l ) can be readily applied to problems deﬁned on diﬀerentgrid sizes or geometries. Lastly, which is more important, the following theorem justiﬁes the use of thisparameterization to construct convergent smoothers. Theorem 4.2

Suppose H is a CNN with k convolutional layers. For a ﬁxed matrix A , if parameterizedproperly, H can be a convergent smoother.Proof. See Appendix B.

Several strategies for training multigrid solvers using CNN as smoothers have been considered in this work.We discuss the algorithms and summarize their advantages and disadvantages in this section.The ﬁrst training strategy is to train H ( l ) separately for each multigrid level l = 0 , . . . , L −

1, where weconstruct a training set D ( l ) similar to (9) for the operator A ( l ) . That is, we train H ( l ) to make iteration (3)convergent by minimizing the error between the approximate solution obtained at iteration k and the groundtruth solution. As suggested in [Hsieh et al., 2019], we also choose diﬀerent iteration number k , 1 ≤ k ≤ b in the training, so that H ( l ) learns to converge at each iteration, where larger b mimics the behavior ofsolving problems to higher accuracy while smaller b mimics inexpensive smoothing steps in multigrid. Thistraining strategy is simple and the trainings on diﬀerent levels are totally independent. However, we found theobtained H ( l ) usually do not exhibit good smoothing property of reducing high-frequency errors, especiallywhen H ( l ) is a shallow neural network. This phenomenon is expected since the training strategy does notconsider the underneath coarser-grid hierarchy and tries to reduce errors over the whole spectrum of A ( l ) .5 lgorithm 2 Adaptive training of multigrid CNN smoothers Input:

Multigrid hierarchy: number of multigrid levels L + 1, coarsest-grid solver at level L , namelyΨ ( L ) , coeﬃcient matrix A ( l ) where A (0) = A , and interpolation operator P ( l ) , for l = 0 , . . . , L −

1. Size oftraining set Q . Maximum allowed number of smoothing steps b Output:

Smoothers H (0) , . . . H ( L − for l = L − , . . . , do Construct training set: D ( l ) = Q (cid:91) j =1 { t j } , t ( l ) j = { A ( l ) , f ( l ) j , ( u ( l )0 ) j , ( u ( l ) ∗ ) j } Initialize the weights of H ( l ) Perform stochastic gradient descent (SGD) to minimize loss function: (cid:88) t ( l ) j ∈D ( l ) ,k ∼U (1 ,b ) (cid:107) Φ ( l ) (( u ( l )0 ) j , f ( l ) j , k ) − ( u ( l ) ∗ ) j (cid:107) With Φ( u , f, ≡ u , run forward propagation byΦ ( l ) ( u , f, k ) = Φ ( l ) ( u , f, k −

1) + Ψ ( l ) ( r k − ) ,r k − := f − A ( l ) Φ( u , f, k − , Ψ ( l ) ( r k − ) = t k − + H ( l ) ( r k − − At k − ) ,t k − := H ( l ) ( r k − ) + P ( l ) Ψ ( l +1) (( P ( l ) ) T s k − ) ,s k − := r k − − AH ( l ) ( r k − ) , and update H ( l ) by back propagation end for In contrast, a well-trained H ( l ) with high complexities, deeper in the layers and larger in the convolutionalkernels, can approximate the action of the inverse of A ( l ) well, but using it as a smoother is not eﬃcientnonetheless, and moreover, the training cost will be signiﬁcantly higher.A second training approach is to optimize the objective function (8) directly over M ( l ) at all levels, l = 0 , . . . , L −

1. This approach targets at optimizing convergence of the overall multigrid V-cycles andconsiders both the smoothing and the coarse-grid correction. However, training the CNNs at all levels togetherturns out to be prohibitively expensive.Finally, we propose an eﬃcient adaptive training strategy that can impose the smoothing property byrecursively training the smoothing CNN at a ﬁne level. The training process starts from the second coarsestlevel and is repeatedly applied to the ﬁner levels, given that the smoothers at coarser levels have beenalready trained, so that solve with the coarse-grid operator can be replaced with a V-cycle using the availablemultigrid hierarchy at one level down. The adaptive training algorithm is sketched in Algorithm 2.Figure 1 illustrates the procedure of adaptively training a 5-level multigrid solver in 4 stages, startingat level 3. The loss is given by L (3) = (cid:80) j,k (cid:107) Φ (3) (( u (3)0 ) j , ( f (3) ) j , k ) − ( u (3) ∗ ) j (cid:107) , where Φ (3) represents thetwo-level multigrid with levels 3 and 4. In the second stage, the training proceeds at level 2 for CNN H (2) utilizing the underlying 2-level hierarchy obtained from the ﬁrst stage. This procedure continues until H (0) iscomputed at the ﬁnest level and the entire training is completed, so the resulting multigrid hierarchy Φ (0) can be used for solving systems of equations with A (0) ≡ A .Another appealing property of the proposed training approach is the updatability of smoothers usingneural networks. The trained smoothers can be updated in another training process by injecting the errors thatcannot be eﬀectively reduced by the current multigrid solver back to the training set. Speciﬁcally, to improvethe smoothers in a trained multigrid solver Φ (0) , we can ﬁrst apply Φ (0) to homogeneous equation Au = 0 for k steps with a random initial vector u and get the approximate solution u k , i.e., u k = Φ (0) ( u , , k ), then6 evel 4level 3level 2level 1level 0 H (3) direct solver H (2) H (1) H (0) Stage 1 Stage 2 Stage 3 Stage 4

Figure 1: Proposed adaptive training strategy for a 5-grid method. The training starts from level 3 andproceeds upward to level 0. When H ( l ) is being trained, lower level H ( j ) for j = l + 1 , . . . , r ( l ) k = ( P ( l − ) T r ( l − k with r (0) k = − Au k to the training set at each level l , andﬁnally re-train Φ (0) as before with the new augmented training sets using the existing H ( l ) in the multigridhierarchy as the initial values. In this paper we consider the following two dimensional anisotropic rotated Laplacian problem: −∇ · ( T ∇ u ( x, y )) = f ( x, y ) (10)where 2 × T is deﬁned as T = (cid:20) cos θ + ξ sin θ cos θ sin θ (1 − ξ )cos θ sin θ (1 − ξ ) sin θ + ξ cos θ (cid:21) , (11)with θ being the angle of the anisotropy and ξ being the conductivity. We discretize the operators ∆ u and u xy using the standard ﬁve-point stencils (See Appendix C).We use multigrid V-cycles to solve the resulting discretized linear system Au = f , where the coeﬃcientmatrix A is parameterized with ( θ , ξ , n , G ). Here n is the grid size and G is the geometry of the grid.We show the robustness and eﬃciency of the proposed neural smoothers on a variety of sets of parameters( θ, ξ, n, G ). For each set of the parameters, we train the neural smoothers on dataset constructed on squaredomains with small grid size, and show that the trained neural smoothers can outperform standard ones suchas weighted Jacobi. Furthermore, we demonstrate that the trained neural smoothers can be applied to solvemuch larger problems and problems with more complex geometries without retraining.Since our focus of this work is on smoothers, we adopt standard algorithms for multigrid coarsening andgrid-transfer operators. Speciﬁcally, we consider full coarsening, which is illustrated in Figure 2 for 2D grids,where grid points are coarsened in both x - and y -dimensions. The associated restriction and interpolation arefull weighting, a weighted average in 3 ×   and 14   . We also consider red-black coarsening that has a coarsening factor of about 2 shown in Figure 3 forthe ﬁrst 3 levels. Note that the coarsening on level 0 is essentially a semi-coarsening along the 45 ◦ angle,7igure 2: Full coarsening of a 2D grid. Fine points are red and coarse points are black. Full weightingrestriction is shown by the arrows to the coarse point at the center. Fine Level Coarse level and on level 1 the coarsening is performed on the 45 ◦ -rotated meshes, which generates the grid on level 2that amounts to a semi-coarsening along the y -dimension. The restriction and interpolation stencils usedassociated with this coarsening are given by18 

11 4 11  and 14 

11 4 11  . Figure 3: Red-black coarsening of a 2D grid, where red points are ﬁne points and black points are coarsepoints. level = 0 level = 1 level = 2

To evaluate our method, we compare the performance of multigrid methods using Algorithm 1 equippedwith convolutional neural smoothers that are trained adaptively (denoted by α -CNN ), convolutional neuralsmoothers trained independently (denoted by CNN ) and weighted Jacobi smoother (denoted by ω -Jacobi )for solving a variety of linear systems. These problems are generated by varying the parameters ( ξ , θ , n , G ). Training details

First, we train smoothers independently using the ﬁrst strategy discussed in Section4.2. For each smoother, we construct 50 problem instances of size 16 . Then, we use the adaptive trainingframework to train smoothers using Algorithm 2. The training process for a 5-level multigird has 4 stages.At each stage we construct a training data set which contains 50 instances of the problem on each level. Allstages have the same size of the coarsest grid. In particular, under full coarsening scheme, at stage l theproblems are constructed on the (4 − l )th level and have grid size of (2 l +2 − . Under red-black coarseningscheme, at stage 1 and stage 2 the problem instances have size of 9 and at stage 3 and stage 4 the problemshave size of 17 . This is because when we apply red-black coarsening to a regular grid, the grid becomesirregular, therefore we need to add zeros to the irregular grid so that we can apply CNNs more eﬃciently.8 eural networks We use CNNs to approximate the action of the inverse of the smoother. In particular,under full coarsening scheme, for both CNN and α -CNN smoothers, H ( l ) is parameterized as follows H ( l ) = f ( l )5 ( f ( l )4 ( · · · ( f ( l )2 ( f ( l )1 )) · · · ) + f ( l )6 , (12)where each f ( l ) i is parameterized by a 3 × φ ( l ) i . We initialize the weights of φ ( l )1 , . . . , φ ( l )5 with zeros and the weights of φ ( l )6 to be the inverse Jacobi stencil so that H ( l ) is initialized as Jacobi. Forred-black coarsening, H ( l ) is parameterized as H ( l ) = f ( l )2 ( f ( l )1 ) . (13)Note that we could use more convolutional layers and for each grid point we could also explore a larger rangeof the neighborhood, which can typically lead to a faster convergence rate at the price of more computationcosts per iteration. The current settings are found to give the best trade-oﬀ between convergence rate andtime-to-solution. Evaluation metrics

We train the smoothers on problems with small grid sizes where the ground truthcan be easily obtained. When we test on large-scale problems, it is time consuming to obtain the groundtruth. Therefore when we evaluate the performance, we use the convergence threshold relative residual (cid:107) f − A ˆ u (cid:107) (cid:107) f (cid:107) < − as the stopping criterion which can avoid the requirement of exact solutions. We compareboth the number of iterations and the run time for multigrid solvers using diﬀerent smoothers to reach thesame accuracy. To reduce the eﬀect of randomness, for each test problem, we run the multigrid solvers tosolve 10 problems with diﬀerent random right-hand sides and present the averaged numbers. Convergence rate

Since coarser problems are usually better conditioned, the smoothers on the ﬁnest levelhave the biggest impact on the overall convergence. In this experiment we compare the spectral properties ofthe smoothers on the ﬁnest level. We ﬁrst compare the spectral radius of the iteration matrices (3) constructedby ω -Jacobi smoothers and α -CNN smoothers and summarize the results in Table 1. These statistics arecalculated on two sets of test problems deﬁned on one 16 ×

16 grid. In the ﬁrst set, θ is ﬁxed at 0 and ξ = 100 , , , ξ is ﬁxed at 100 and θ = 0 , π , π , π . The corresponding comparisonof ideal convergence bounds (7) on these tests is provided in Table 2. θ = 0 ξ = 100 ξ = 200 ξ = 300 ξ = 400 ω -Jacobi 0.9886 0.9886 0.9886 0.9886 α -CNN 0.7660 0.8060 0.8588 0.7883 ξ = 100 θ = 0 θ = π/ θ = π/ θ = π/ ω -Jacobi 0.9886 0.9913 0.9934 0.9942 α -CNN 0.7660 0.7743 0.9652 0.9728Table 1: Spectral radius of iteration matrices (3) of two-grid methods using full coarsening and ω -Jacobi and6-layered α -CNN smoothers for rotated Laplacian problems with diﬀerent θ and ξ . The grid size is 16 × θ = 0 ξ = 100 ξ = 200 ξ = 300 ξ = 400 ω -Jacobi 0.9886 0.9886 0.9886 0.9886 α -CNN 0.7660 0.8060 0.8588 0.7883 ξ = 100 θ = 0 θ = π/ θ = π/ θ = π/ ω -Jacobi 0.9886 0.9913 0.9934 0.9942 α -CNN 0.7660 0.7743 0.9651 0.9728Table 2: Ideal convergence bound (7) corresponding to the same methods and problems in Table 1.9he results in Tables 1 and 2 show that for each rotated Laplacian problem, the convergence measureassociated with α -CNN smoothers are much smaller than those associated with ω -Jacobi smoothers, whichindicates a faster convergence can be achieved by multigrid solvers equipped with α -CNN smoothers. Smoothing property

To show that our proposed method can learn the optimal smoother with thebest smoothing property, for each eigenvector v (that has the unit 2-norm) of the ﬁne-level operator A associated with parameters θ = π , ξ = 100, N = 16 on a square domain, we compute its convergencefactor (cid:107) v − H (0) ( Av ) (cid:107) , where H (0) is the smoother on the ﬁnest level. An eﬃcient smoother should leadto small convergence factors for eigenvectors associated with larger eigenvalues. The results are shown inFigure 4, where the eigenmodes are listed in the descending order of the corresponding eigenvalues. TheCNN smoother can reduce low-frequency errors more rapidly than ω -Jacobi, however, both of them havecomparable performance for damping high-frequency errors. In contrast, α -CNN has the best performance,which exhibits a superior smoothing property as the convergence factors of eigenvectors associated the largeeigenvalues are about 6 times smaller than those assocaited with the other two smoothers. . . .

751 Eigenmode index C o n v e r g e n ce f a c t o r ω -Jacobi CNN α -CNN Figure 4: Convergence factors of ω -Jacobi, CNN and α -CNN smoothers to the eigenvectors of A for therotated Laplacian on a 16 ×

16 grid, where θ = π and ξ = 100. The eigenvectors are sorted in the descendingorder of the corresponding eigenvalues. Generalization property

To illustrate that our proposed method is useful, besides showing the statistics,we present the actual iteration numbers and run time for multigrid solvers to converge. Also for a givenPDE problem, we want to only train the neural smoothers once, that is, the neural smoothers need not to beretrained if we increase the grid size or change the geometry of the problem. In this experiment, we ﬁrst showthat the trained smoothers can be generalized to diﬀerent grid sizes without retraining. We ﬁx the parameterof the problems to be ξ = 100 and θ = π on one square domain. We show in Figure 5 that for problems ofsize 1023 , multigird methods using α -CNN smoothers converge faster in terms of the number of iterationsthan multigird methods using CNN and ω -Jacobi smoothers by factors of 1 . . α -CNN smoothers is more than ω -Jacobi, the time for iterations of multigird methods using α -CNN is only faster than that using CNN and ω -Jacobi by factors of 1 .

68 and 2 .

1, respectively.Since CNN smoothers were trained independently, they are not as successful as α -CNN to capture thesmoothing property of reducing errors that cannot be reduced by lower levels of multigird. Hence, we only10 N N u m b e r o f i t e r a t i o n s ω -Jacobi CNN α -CNN

63 127 255 511 10230102030 N T i m e ( s ) ω -Jacobi CNN α -CNN Figure 5: Numbers of iterations and run time required by multigrid with full coarsening to reach theconvergence tolerance 10 − for solving the rotated Laplacian problem of grid sizes 63 , 127 , 255 , 511 and1023 , with parameters ξ = 100 and θ = π on square domains.compare α -CNN and ω -Jacobi smoothers in the rest of the paper. Next we ﬁx the parameters of the problemsto be θ = π , ξ = 100 and show that the trained α -CNN smoothers can be generalized to problems with twodiﬀerent geometries (shown in Figure 6) without retraining.Figure 6: Ground truth solutions on square domain, cylinder domain and L-shaped domain.The results for the two diﬀerent domains are shown in Figure 7. We can see that since we are usingthe convolutional layers to approximate the inverse of the smoothers, α -CNN use the information in theneighborhood information to smooth the error at each grid point and therefore without retraining, thesmoother trained on square domain can still lead multigrid methods to converge 4 . . . On the L-shaped domain for the same sized problem, the performance improvement is 4 . . θ = π , with full coarsening, the multigrid method using α -CNNsmoothers is 19 . . α -CNN smoothers canstill require much fewer iterations than the one with ω -Jacobi by 1 . . N N u m b e r o f i t e r a t i o n s

63 127 255 511 10230510 N T i m e ( s )

63 127 255 511 10232040 N N u m b e r o f i t e r a t i o n s ω -Jacobi α -CNN

63 127 255 511 10230510 N T i m e ( s ) ω -Jacobi α -CNN Figure 7: Numbers of iterations and run time required by multigrid solvers for solving the rotated Laplacianproblems with parameters θ = π and ξ = 100 on the cylinder domain (top two ﬁgures) and the L-shapeddomain (bottom two ﬁgures). 12 / π/ π/ π/ π/ θ N u m b e r o f i t e r a t i o n s π/ π/ π/ π/ π/ θ T i m e ( s ) π/ π/ π/ π/ π/ θ N u m b e r o f i t e r a t i o n s ω -Jacobi α -CNN π/ π/ π/ π/ π/ θ T i m e ( s ) ω -Jacobi α -CNN Figure 8: Numbers of iterations and run time required by multigrid solvers for solving the rotated Laplacianproblems of size n = 511 with θ = [ π , π , π , π , π ] and ξ = 100. The top two ﬁgures show the performanceof multigrid with full coarsening and the bottom two ﬁgures show the performance of multigrid with red-blackcoarsening. 13 Conclusion

In this work we propose an eﬃcient framework for training smoothers in the form of multi-layered CNNs thatcan be equipped by multigrid methods for solving linear systems arising from PDE problems. The trainingprocess of the proposed smoothing algorithm, called α -CNN, is guided by multigrid convergence theories andhave the desired property of minimizing errors that cannot be eﬃciently annihilated by coarse-grid corrections.Experiments on rotated Laplacian problems show superior smoothing property of α -CNN smoothers that leadsto better performance of multigrid convergence when combined with standard coarsening and interpolationschemes compared with classical smoothers. We also show that well-trained α -CNN smoothers on smallproblems can be generalized to problems of much larger sizes and diﬀerent geometries without retraining. Disclaimer

This document was prepared as an account of work sponsored by an agency of the United States government.Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of theiremployees makes any warranty, expressed or implied, or assumes any legal liability or responsibility forthe accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, orrepresents that its use would not infringe privately owned rights. Reference herein to any speciﬁc commercialproduct, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarilyconstitute or imply its endorsement, recommendation, or favoring by the United States government orLawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do notnecessarily state or reﬂect those of the United States government or Lawrence Livermore National Security,LLC, and shall not be used for advertising or product endorsement purposes.

A Proof of equivalent conditions for convergent smoothers

The following result shows a suﬃcient and necessary condition for having a convergent smoother M in theenergy norm. Theorem A.1

Assuming A is SPD, each step of iteration (3) is convergent in the energy norm, i.e., (cid:107) e k +1 (cid:107) A = (cid:107) Ge k (cid:107) A < (cid:107) e k (cid:107) A , if and only if M + M T − A is SPD.Proof. We have the following identity, (cid:107) e k +1 (cid:107) A = (cid:107) ( I − M − A ) e k (cid:107) A = (cid:107) e k (cid:107) A − e k , M − Ae k ) A + (cid:107) M − Ae k (cid:107) A = (cid:107) e k (cid:107) A − ((2 M − A ) M − Ae k , M − Ae k ) , so (cid:107) e k +1 (cid:107) A < (cid:107) e k (cid:107) A , for any e k , iﬀ 2 M − A is positive deﬁnite (PD) or equivalently M + M T − A is SPD.Similar result can also be shown in the 2-norm. Theorem A.2

Assuming A is SPD, each step of iteration (3) is convergent in the 2-norm, i.e., (cid:107) e k +1 (cid:107) = (cid:107) Ge k (cid:107) < (cid:107) e k (cid:107) , for any e k , if and only if A − M + M T A − − I is SPD.Proof. We have the following identity, (cid:107) e k +1 (cid:107) = (cid:107) e k (cid:107) − e k , M − Ae k ) + (cid:107) M − Ae k (cid:107) = (cid:107) e k (cid:107) − (cid:0) (2 A − M − I ) M − Ae k , M − Ae k (cid:1) . So, (cid:107) e k +1 (cid:107) < (cid:107) e k (cid:107) iﬀ 2 A − M − I is PD, or equivalently A − M + M T A − − I is SPD.Moreover, we can show that if A − M + M T A − T − I is SPD, the numerical radius ν ( G ) <

1, for which weﬁrst state the following lemma. 14 emma A.1

Suppose C is SPD and ( Bx, x ) > ( αCx, x ) for any x and some α > . Then, B − + B − T isSPD and < ( αB − x, x ) < ( C − x, x ) . (14) Proof.

Since C is SPD and ( Bx, x ) > ( αCx, x ) > B is positive deﬁnite and so is B − , i.e., B − + B − T is SPD. For (14), we have ( αB − x, x ) = ( αC / B − x, C − / x ), from which and Cauchy-Schwarz inequality,it follows that ( αB − x, x ) ≤ α (cid:107) C / B − x (cid:107) (cid:107) C − / x (cid:107) = ( αCB − x, B − x )( C − x, x ) < ( B − x, x )( C − x, x ) . The result is given by dividing both sides by ( B − x, x ). Theorem A.3

Assuming A − M + M T A − T − I is SPD, then the numerical radius ν ( G ) < .Proof. By the assumption ( A − M x, x ) > ( x/ , x ), so by Lemma A.1, we have 0 < ( M − Ax, x ) < x, x ),i.e., the numerical radius ν ( G ) < Corollary A.1

Assuming A − M + M T A − T − I is SPD, then the spectral radius ρ ( G ) < . B Proof of convolutional network as convergent smoother

Lemma B.1

Norms are continuous functions in R n .Proof. Suppose x and y are two vectors in R n . Deﬁne u = x − y and write u in the canonical basis as u = (cid:80) ni =1 δ i e i . Then we have (cid:107) u (cid:107) = (cid:107) n (cid:88) i =1 δ i e i (cid:107) ≤ n (cid:88) i =1 | δ i |(cid:107) e i (cid:107) ≤ max | δ i |(cid:107) e i (cid:107) . Setting M = (cid:80) ni =1 (cid:107) e i (cid:107) we get (cid:107) u (cid:107) ≤ M max | δ i | = M (cid:107) x − y (cid:107) ∞ . Let (cid:15) be given and take x, y such that (cid:107) x − y (cid:107) ∞ ≤ (cid:15)M . Then, by using the triangle inequality we obtain: |(cid:107) x (cid:107) − (cid:107) y (cid:107)| ≤ (cid:107) x − y (cid:107) ≤ M max δ i ≤ M (cid:15)M = (cid:15). This means that we can make (cid:107) y (cid:107) arbitrarily close to (cid:107) x (cid:107) by making y close enough to x in the sense of thedeﬁned metric. Therefore, norms are continuous functions. Lemma B.2 In R n all norms are equivalent.Proof. We only need to show that for Φ = (cid:107) · (cid:107) some norm and Φ = (cid:107) · (cid:107) ∞ . All other cases will followfrom this one.First we can show that for some scalar α we have (cid:107) x (cid:107) ≤ α (cid:107) x (cid:107) ∞ . Express x in the canonical basis of R n as x = (cid:80) ni =1 x i e i . Then (cid:107) x (cid:107) = (cid:107) n (cid:88) i =1 x i e i (cid:107) ≤ n (cid:88) i =1 | x i |(cid:107) e i (cid:107) ≤ max | x i | n (cid:88) i =1 (cid:107) e i (cid:107) ≤ (cid:107) x (cid:107) ∞ α, where α = (cid:80) ni =1 (cid:107) e i (cid:107) .We then show that there is a β such that (cid:107) x (cid:107) ≥ β (cid:107) x (cid:107) ∞ . Assume x (cid:54) = 0 and consider u = x/ (cid:107) x (cid:107) ∞ . Notethat u has inﬁnity norm equal to one. Therefore it belongs to the closed and bounded set S ∞ = { v |(cid:107) v (cid:107) ∞ = 1 } .15ased on Lemma B.1, the minimum of the norm (cid:107) u (cid:107) for all u ’s is reachable in the sense that there is a u ∈ S ∞ such that min u ∈ S ∞ (cid:107) u (cid:107) = (cid:107) u (cid:107) . Let us call β this minimum value. Since this value can not be zero, we then have (cid:107) x (cid:107) x (cid:107) ∞ (cid:107) ≥ β. This implies that (cid:107) x (cid:107) ≥ β (cid:107) x (cid:107) ∞ . This completes the proof. Theorem B.1

Suppose H is a convolutional network with k convolutional layers. For one ﬁxed matrix A , ifparameterized properly, H can be a convergent smoother.Proof. Based on the universality property of deep convolutional neural networks without fully connectedlayers [Zhou, 2020], we know that H can approximate the linear operator A − to an arbitrary accuracymeasured by some norms when k is large enough. Thus, theoretically, HA can be very close to an identitymapping if parameterized properly. Based on Lemmas B.1 and B.2, we know matrix induced norms arecontinuous functions, thus (cid:107) I − HA (cid:107) A can be less than 1 for certain k measured in matrix A -norm. C Finite diﬀerence discretization of PDEs

We discretize the operators ∆ u and u xy in (10) using the following stencils:14 h  − − − −  and 12 h  − − − −  , where h is the grid spacing. D Numerical results of spectral radius

We ﬁx the parameters of the rotated Laplacian problems to be θ = 0, and we train the neural smoothers for ξ = 100 , , ,

400 on square domain. Then we ﬁx the parameters to be ξ = 100, and we train the neuralsmoothers for θ = 0 , . . . , π . We compare the spectral radius of the iteration matrices of 5-level multigridsolvers equipped with diﬀerent smoothers and summarized the results in Table 3. θ = 0 ξ = 100 ξ = 200 ξ = 300 ξ = 400 ω -Jacobi 0.9853 0.9918 0.9940 0.9951 α -CNN 0.6816 0.8189 0.8805 0.8936 ξ = 100 θ = 0 θ = π/ θ = π/ θ = π/ ω -Jacobi 0.9853 0.9436 0.8981 0.8837 α -CNN 0.6816 0.4534 0.4547 0.4216Table 3: Spectral radius of the iteration matrices corresponding to the 5-level multigrid methods with fullcoarsening and ω -Jacobi and 6-layered α -CNN smoother for rotated Laplacian problems with diﬀerent θ and ξ . The mesh size is 16 × E More numerical results

Figure 9 shows the performance of 5-level multigrid with ω -Jacobi smoothers and α -CNN smoothers usingfull coarsening and red-black coarsening with the same problem setting as in Figure8. However, since we areusing 6 convolutional layers with full coarsening and 2 convolutional layers with red-black coarsening. Forfair comparison in terms of computational ﬂops per iteration, in this experiment we run 6 jacobi steps eachiteration for full coarsening and 2 jacobi steps for red-black coarsening.16 / π/ π/ π/ π/ θ N u m b e r o f i t e r a t i o n s π/ π/ π/ π/ π/ θ T i m e ( s ) π/ π/ π/ π/ π/ θ N u m b e r o f i t e r a t i o n s ω -Jacobi α -CNN π/ π/ π/ π/ π/ θ T i m e ( s ) ω -Jacobi α -CNN Figure 9: Numbers of iterations and run time required by multigrid solvers for solving the rotated Laplacianproblems of size n = 511 with θ = [ π , π , π , π , π ] and ξ = 100. The top two ﬁgures show the performanceof multigrid with full coarsening and the bottom two ﬁgures show the performance of multigrid with red-blackcoarsening. References

Allison H. Baker, Robert D. Falgout, Tzanio V. Kolev, and Ulrike Meier Yang. Multigrid smoothers forultraparallel computing.

SIAM Journal on Scientiﬁc Computing , 33(5):2864–2887, 2011. doi: 10.1137/100798806. URL https://doi.org/10.1137/100798806 .Randolph E. Bank and Craig C. Douglas. Sharp estimates for multigrid rates of convergence with generalsmoothing and acceleration.

SIAM Journal on Numerical Analysis , 22(4):617–633, 1985. doi: 10.1137/0722038. URL https://doi.org/10.1137/0722038 .Jens Berg and Kaj Nystr¨om. A uniﬁed deep artiﬁcial neural network approach to partial diﬀerential equationsin complex geometries.

Neurocomputing , 317:28–41, 2018.Matthias Bolten and Karsten Kahl. Using block smoothers in multigrid methods.

PAMM , 12(1):645–646,2012. doi: https://doi.org/10.1002/pamm.201210311. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/pamm.201210311 .A. Brandt, S. McCormick, and J. Ruge. Algebraic multigrid (AMG) for sparse matrix equations. In David J.Evans, editor,

Sparsity and its Applications , pages 257–284. Cambridge University Press, Cambridge, 1985.ISBN 0521262720.Achi Brandt. Algebraic multigrid (amg) for sparse matrix eqations.

Sparsity and its Applications , pages257–284, 1984. 17chi Brandt. Algebraic multigrid theory: The symmetric case.

Applied Mathematics and Computation ,19(1):23–56, 1986. ISSN 0096-3003. doi: https://doi.org/10.1016/0096-3003(86)90095-0. URL .M. Brezina, A. J. Cleary, R. D. Falgout, V. E. Henson, J. E. Jones, T. A. Manteuﬀel, S. F. McCormick, andJ. W. Ruge. Algebraic multigrid based on element interpolation (AMGe).

SIAM Journal on ScientiﬁcComputing , 22(5):1570–1592, 2001. doi: 10.1137/S1064827598344303. URL https://doi.org/10.1137/S1064827598344303 .William L Briggs, Van Emden Henson, and Steve F McCormick.

A multigrid tutorial . SIAM, 2000.Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks fromdynamical systems view. arXiv preprint arXiv:1710.10348 , 2017.D. J. Evans and W. S. Yousif. The explicit block relaxation method as a grid smoother in the multigridv-cycle scheme.

International Journal of Computer Mathematics , 34(1-2):71–78, 1990. doi: 10.1080/00207169008803864. URL https://doi.org/10.1080/00207169008803864 .Robert D. Falgout and Panayot S. Vassilevski. On generalizing the algebraic multigrid framework.

SIAMJournal on Numerical Analysis , 42(4):1669–1693, 2004. doi: 10.1137/S0036142903429742. URL https://doi.org/10.1137/S0036142903429742 .Daniel Greenfeld, Meirav Galun, Ronen Basri, Irad Yavneh, and Ron Kimmel. Learning to optimize multigridpde solvers. In

International Conference on Machine Learning , pages 2415–2423. PMLR, 2019.T. Gudmundsson, C. S. Kenney, and A. J. Laub. Small-sample statistical estimates for matrix norms.

SIAMJournal on Matrix Analysis and Applications , 16(3):776–792, 1995. doi: 10.1137/S0895479893243876. URL https://doi.org/10.1137/S0895479893243876 .Eldad Haber, Lars Ruthotto, Elliot Holtham, and Seong-Hwan Jun. Learning across scales—multiscalemethods for convolution neural networks. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 32, 2018.Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial diﬀerential equations usingdeep learning.

Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018.Juncai He and Jinchao Xu. Mgnet: A uniﬁed framework of multigrid and convolutional neural network.

Science china mathematics , 62(7):1331–1354, 2019.Philipp Holl, Vladlen Koltun, and Nils Thuerey. Learning to control pdes with diﬀerentiable physics. arXivpreprint arXiv:2001.07457 , 2020.Jun-Ting Hsieh, Shengjia Zhao, Stephan Eismann, Lucia Mirabella, and Stefano Ermon. Learning neuralPDE solvers with convergence guarantees. In

International Conference on Learning Representations , 2019.Alexandr Katrutsa, Talgat Daulbaev, and Ivan Oseledets. Deep multigrid: learning prolongation andrestriction matrices. arXiv preprint arXiv:1711.03825 , 2017.Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artiﬁcial neural networks for solving ordinary andpartial diﬀerential equations.

IEEE transactions on neural networks , 9(5):987–1000, 1998.Paul T. Lin, John N. Shadid, and Paul H. Tsuji.

Krylov Smoothing for Fully-Coupled AMG Preconditionersfor VMS Resistive MHD , pages 277–286. Springer International Publishing, Cham, 2020. ISBN 978-3-030-30705-9. doi: 10.1007/978-3-030-30705-9 24. URL https://doi.org/10.1007/978-3-030-30705-9_24 .Ilay Luz, Meirav Galun, Haggai Maron, Ronen Basri, and Irad Yavneh. Learning algebraic multigrid usinggraph neural networks. arXiv preprint arXiv:2003.05744 , 2020.Siddhartha Mishra. A machine learning framework for data driven acceleration of computations of diﬀerentialequations. arXiv preprint arXiv:1807.09519 , 2018.18. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.In

Medical Image Computing and Computer-Assisted Intervention (MICCAI) , volume 9351 of

LNCS , pages234–241. Springer, 2015. URL http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a .(available on arXiv:1505.04597 [cs.CV]).John W Ruge. Algebraic multigrid (amg) for geodetic survey problems. In

Prelimary Proc. Internat. MultigridConference, Fort Collins, CO , 1983.Yousef Saad.

Iterative Methods for Sparse Linear Systems . Society for Industrial and Applied Mathematics,second edition, 2003. doi: 10.1137/1.9780898718003. URL https://epubs.siam.org/doi/abs/10.1137/1.9780898718003 .Jonas Schmitt, Sebastian Kuckuk, and Harald K¨ostler. Optimizing geometric multigrid methods withevolutionary computation. arXiv preprint arXiv:1910.02749 , 2019.Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial diﬀerentialequations.

Journal of computational physics , 375:1339–1364, 2018.Mingui Sun, Xiaopu Yan, and RJ Sclabassi. Solving partial diﬀerential equations in real-time using artiﬁcialneural network signal processing as an alternative to ﬁnite-element analysis. In

International Conferenceon Neural Networks and Signal Processing, 2003. Proceedings of the 2003 , volume 1, pages 381–384. IEEE,2003.Wei Tang, Tao Shan, Xunwang Dang, Maokun Li, Fan Yang, Shenheng Xu, and Ji Wu. Study on a poisson’sequation solver based on deep learning technique. In , pages 1–3. IEEE, 2017.Ulrich Trottenberg, Cornelius W Oosterlee, and Anton Schuller.

Multigrid . Elsevier, 2000.Shiyin Wei, Xiaowei Jin, and Hui Li. General solutions for nonlinear diﬀerential equations: a rule-basedself-learning approach using deep reinforcement learning.

Computational Mechanics , 64(5):1361–1374, 2019.Gabriel Wittum. On the robustness of ilu smoothing.

SIAM Journal on Scientiﬁc and Statistical Computing ,10(4):699–717, 1989. doi: 10.1137/0910043. URL https://doi.org/10.1137/0910043 .Jinchao Xu and Ludmil Zikatanov. Algebraic multigrid methods.

Acta Numerica , 26:591–721, 2017. doi:10.1017/S0962492917000083.Ding-Xuan Zhou. Universality of deep convolutional neural networks.

Applied and Computational HarmonicAnalysis , 48(2):787–794, 2020. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2019.06.004. URL