[PDF] The Divide-and-Conquer Framework: A Suitable Setting for the DDM of the Future

Abstract

This paper was prompted by numerical experiments we performed, in which algorithms already available in the literature (DVS-BDDM) yielded accelerations (or speedups) many times larger (more than seventy in some examples already treated, but probably often much larger) than the number of processors used. Based on these outstanding results, here it is shown that believing in the standard ideal speedup, which is taken to be equal to the number of processors, has limited much the performance goal sought by research on domain decomposition methods (DDM) and has hindered much its development, thus far. Hence, an improved theory in which the speedup goal is based on the Divide and Conquer algorithmic paradigm, frequently considered as the leitmotiv of domain decomposition methods, is proposed as a suitable setting for the DDM of the future.

Full PDF

11 The Divide-and-Conquer Framework: A Suitable Setting for the DDM of the Future

Ismael Herrera-Revilla*, Iván Contreras and Graciela S. Herrera

Instituto de Geofísica, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico *Corresponding author: [email protected]

This paper was prompted by numerical experiments we performed, in which algorithms already available in the literature (DVS-BDDM) yielded accelerations (or speedups ) many times larger (more than seventy in some examples already treated, but probably often much larger) than the number of processors used. Based on these outstanding results, here it is shown that believing in the standard ideal speedup , which is taken to be equal to the number of processors, has limited much the performance goal sought by research on domain decomposition methods (DDM) and has hindered much its development, thus far. Hence, an improved theory in which the speedup goal is based on the Divide and Conquer algorithmic paradigm, frequently considered as the leitmotiv of domain decomposition methods, is proposed as a suitable setting for the DDM of the future. Keywords DDM, DVS-DDM, parallel computation, ideal parallel speedup, divide and conquer

1. Introduction

The present paper was prompted by some outstanding results that we recently obtained in a sequence of numerical and computational experiments applying some parallel algorithms already available in the literature (the DVS-BDDC [1-8]). They are outstanding, because they contradict the generally accepted belief that in parallel computation the acceleration, or speedup , cannot be greater than the processors number [9-22]. For example, in our numerical experiments using 400 processors in parallel we achieve a speedup of 29,278, which is 73.2 times greater than the maximum acceleration that such a belief permits. In agreement with such a belief, the speedup goal sought in most, probably all, research that has been carried out in domain decomposition methods (DDM) up to now [9-20], is equal to the number of processors used. Since our results show that considerably larger speedups are feasible, the conclusion is drawn that the speedup goal sought so far is too modest and restrictive; hence, it should be replaced by a larger and more ambitious performance goal in future DDM research. To this end, we resort to the DIVIDE AND CONQUER STRATEGY, which for parallel processing of partial differential equations probably is the most basic algorithmic paradigm [23]. Furthermore, we formulate it in a manner that yields precise and clearly defined quantitative performance goals, to be called DC-goals, which are larger, yet realistic. The adequacy of the modified framework so obtained is verified by satisfactorily incorporating the outstanding results just mentioned in it. The paper is organized as follows, Section 2 presents some background material on the Derived-Vector Space (DVS) approach to DDM, and DVS-BDDC [1-8]. The outstanding performance results that prompted this article are introduced and explained in Section 3. An inconsistency of standard approaches to DDM that such results exhibit, is pointed out and discussed in Section 4, while Section 5 introduces some measures of performance whose conspicuous feature is that they are defined with respect to a performance goal. The ideas and results contained in Sections 3 to 5, are then used in Sections 6 and 7, to show that both, the concept of ideal parallel-performance and the belief that the ideal parallel speedup is p , lack firm bases. The “divide and conquer” algorithmic paradigm ([23], p. v ), -the DC-paradigm - is recalled and revised in Section 8, and a quantitative DC - performance goal adequate to be used in future DDM research, is derived from it. There, it is also shown that in the examples here discussed the latter performance goal is larger, by a big factor, than p ; indeed, in the examples here treated, the DC-speedup goal is close to p and the factor we are referring is close to p p p  , which is large when the number of` processors is large. When the outstanding numerical and computational results that prompted this paper are incorporated in the DC-framework they look perfectly normal, as it is shown in Section 10, since their

DC-efficiencies , for p  , range from to . Sections 11 and 12 are devoted to exhibit the severe restrictions that believing in the relation:   , S p n p  has imposed on software developed under that assumption. Finally, Section 13 states the paper’s conclusions.

2. Some background

I. Herrera, and some of his coworkers, have been working in domain decomposition methods (DDM) since 2002, when he organized and hosted the Fourteenth International Conference on Domain Decomposition Methods (DDM) [23]. In works before the present paper [1-8], they have indicated that the use of coarse meshes in which some of the nodes are shared by several subdomains is a serious handicap because it goes against the ‘divide and conquer paradigm’ , and then the system matrix so obtained is not block-diagonal. To overcome such an inconvenience, they introduced: the derived-nodes , the derived-vectors and the derived-vector space (DVS), which altogether yielded the derived-vector space framework ( DVS-framework ). The main advantage of DVS formulations is that the system matrix is strictly block-diagonal, while those of standard approaches are not [2]. In the

DVS algorithmic framework, the procedure is as follows: Firstly, the partial differential equation is discretized in a non-overlapping fashion. The concept of non-overlapping discretization is given in reference [2]. The most significant and conspicuous property of such a kind of discretizations is that they yield block-diagonal systems of equations directly because of the discretization process and such systems of equations are defined in non-overlapping systems of nodes. A non-overlapping system of nodes is one that belongs to one and only one subdomain of the rough mesh (Fig. 1) There are four DVS-algorithms: DVS-FETI-DP, DVS-BDDC, DVS-PRIMAL and DVS-DUAL. The first two were obtained by mimicking the well-known FETI-DP and BDDC procedures in the derived-vector space . (see, [2] for further details), but the big and very significant difference is that such procedures are applied after the differential equations have been subjected to a non-overlapping discretization , so that the discrete system of linear equations we start with, is block-diagonal. The other two DVS-algorithms, the DVS-PRIMAL and DVS-DUAL, were produced by completion of the theoretical framework (again see, [2]). So far, only the DVS-BDDC algorithm has been numerically tested; in 2016, preliminary computational experiments were published, which proved that the DVS-BDDC was fully competitive with the top DDM algorithms that were available [1]. However, at that time we did not have yet produced the very outstanding results we are now reporting.

3. The outstanding results

More recently, in 2018, the authors have developed a more careful code of the DVS-BDDC algorithm and tested it through a set of numerical experiments, obtaining the outstanding results that are presented and discussed in this Section. They are objectively outstanding because, for example, when the number of processors used is 400 the acceleration produced is 73.2 times by 400; hence, in this application, the DVS-BDDC algorithm produces an acceleration 73.2 times larger than the largest possible according to standard theory. More specifically, the computational experiments here reported, consisted in treating a well posed 2D problem for Laplace differential operator in the highly parallelized supercomputer “Miztli” of the National Autonomous University of Mexico (UNAM), using successively 1, 16, 25, 64, 256 and 400 processors. The notation used to report the numerical and computational results so obtained is given next:    ,, Number of Processors pSize of the Problem nExecution Time T p nSpeedup S p n  (3.1)

Here, the size of the problem is equal to number of degrees of freedom, which in turn is equal to the number of nodes of the fine mesh. In general, the “execution time” and “speedup” are functions of the pair   , p n . In the set of experiments here reported the size of the problem is expressed in terms of the number of degrees of freedom, which is taken to be the number of nodes of the fine mesh, which is kept fixed and equal to ; i.e., n  throughout the set of numerical experiments. The very impressive results of the numerical experiments are given in Table 1 (everywhere p n   , T p n   , S p n   , S p n in terms of p   , p S p n

1 10

178 164.5 10.28p .097 25 10

78 375.4 15.02p .067 64 10

16 1,829 28.58p .035 256 10

2 14,639 57.18p .017 400 10

1 29,278 73.20p .014

Table 1. Results of computational experiments in this paper times are given in seconds), where the fifth column gives the speedup as a multiple of p , the number of processors, which in standard theory of domain decomposition methods is thought to be an unsurmountable speedup. However, in the set of experiments we are reporting, the speedup is much greater than the standard theory foresees, if p  ; even more, it is greater than such an upper bound, by a large factor: 10.28, 15.02, 28.58, 57.18 and 73.2, when the number of processors is 16, 25, 64, 256 and 400, respectively. Observe that the factor increases with the number of processors, which is an enhancing feature. The last column is only included here, for later use.

4. An inconsistency of standard DDM

The standard definition of efficiency is     , , S E p n p S p n   (4.1) and it is usually expressed in percentage. The sub-index S used here, comes from standard and it is used for clarity, since alternative definitions will be introduced later. The Table 2 that follows has been derived from Table 1, by expressing its last column in terms of standard efficiency,   , S E p n . By inspection of Table 2, where percentages much p n   , T p n   , S p n   , S E p n in percentage 1 10

178 164.5 1,028% 25 10

78 375.4 1,502% 64 10

16 1,829 2,858% 256 10

2 14,639 5,718% 400 10

1 29,278 7,320%

Table 2. Using standard efficiency for expressing outstanding results greater than 100% such as 1,028, 1,502, 2,858, 5,718 and 7,320 occur, it is seen that the standard efficiency is not adequate for expressing the outstanding results of the numerical experiments we are reporting, because efficiencies far beyond occur.

5. Revisiting the measures of performance

In this Section we define some measures of parallel-software performance that will be used in the sequel. As usual, such measures will be based on the execution time that is required for completing a task; the shorter the better. According to Eq.(3.1), the notation   , T p n means the execution time when the number of processors is p ; in particular,   T n is the execution time when only one processor is applied. For the sake of clarity, we recall the speedup (or, acceleration ) definition:     

T nS p n T p n  (5.1) The main objective in using a parallel computer is to get a simulation to finish faster than it would in one processor. Furthermore, let us take the position of a software designer who intends to construct software that performs well; so, he defines a performance goal he intends to achieve. The following two procedures for specifying such a goal will be considered; fixing the execution-time goal ,   , G T p n , or fixing the speedup goal ,   , G S p n . Assume either one of them have been specified, then the relative efficiency (relative to a goal performance) is defined by      ,, ,

G G

S p nE p n S p n  (5.2) when   , G S p n is given, or      ,, , GG T p nE p n T p n  (5.3) when   , G T p n is given. These two manners of defining relative efficiency are equivalent, if and only if:       , , 1,

G G

T p n S p n T n  (5.4) Hence          

1, 1,, ,, ,

G GG G

T n T nS p n and T p nT p n S p n   (5.5) The first one of these equalities can be used to obtain   , G S p n when   , G T p n is given, and the second one, conversely. According to Eq.(5.2),       , 1 , ,

G G

E p n S p n S p n    (5.6) Here the symbol  stands for the logical equivalence; i.e., if and only if. Actually, when we choose a goal we do not know if it is achievable, but the initial state satisfies     , , G S p n S p n  since   , G S p n is a desirable state. Hence, at the beginning  

E p n   and this quantity may be taken as a distance to the goal. However, it can also happen that our developments lead to a speedup     , , G S p n S p n  , since generally we do not know beforehand if the speedup   , G S p n is an upper bound of those possible. When that happens,   , 1

E p n  . Conversely, a corresponding argument can be made if the execution time and Eq.(5.3) are used to define the parallel efficiency. The main difference is that, in such a case,     , , G T p n T p n  at the beginning and     , , G T p n T p n  is an indication that the goal has been exceeded.

6. The concept of “ideal parallel speedup”

In the literature on scientific parallel computing and on domain decomposition methods for the numerical solution for partial differential equations, the notion of “ideal parallel speedup” is used when defining absolute efficiency. However, its definition lacks precision. When   , A S p n is the ideal parallel speedup , the relation     , , A S p n S p n  (6.1) holds whenever   , S p n is the acceleration obtained in a parallel computation. If we try to make this notion rigorous, we could say that   , A S p n is the supremum , but what is never made clear is: of what set   , A S p n is the supremum. Even so, when   , A S p n is the ideal parallel speedup , the absolute parallel efficiency is defined to be      ,, ,

A A

S p nE p n S p n  (6.2) Thereby, we mention that the subscript A above, comes from Absolute . However, if we do not know for sure that Eq.(6.1) holds whenever   , S p n is the acceleration obtained in a parallel computation, this is a risky definition. Indeed, if that is the case and there is an execution for which     , , A S p n S p n  (6.3) Then, we would claim that   , S p n is not achievable and we would be satisfied with an acceleration that is close to   , A S p n , even if   , A S p n is much smaller than   , S p n .

7. The international DDM research goal

In the light of the outstanding results we are reporting, it seems that is what has happened in the case of the international DDM research. It is generally thought that Eq.(6.1) holds with   , A S p n p  ; i.e.,   , S p n p  (7.1) Hence, the standard definition of efficiency of Eq.(4.1):     , , S E p n p S p n   (7.2) Comparing this equation with Eq.(5.2) it is seen that Eq.(7.2) implies that the speedup goal, sought by DDM research worldwide is:   , S S p n p  (7.3) Here, we have written   , S S p n for the speedup goal of standard DDM research.

8. The relative DVS efficiency of standard approaches

In this Section we make a simple exercise in which we compute the relative efficiency of standard approaches when the goal speedup is that achieved by the DVS-BDDC algorithm in the numerical experiments here reported. The notation here adopted for such a relative efficiency is SDVS E . Applying the definition of Eq.(5.2), we get        ,, , , SDVS G DVS

S p n pE p n S p n S p n   (8.1) Inspecting the results of our numerical experiments reported in the last column of Table 1, in view of Eq.(7.1), it is seen that the relative efficiency of standard approaches relative to the performance of DVS-BDDC is only , , , and , of what is obtained with DVS-BDDC in these experiments. Hence, our conclusion of this Section is that the speedups goals sought by DDM research worldwide up to now, are too small and should be revised.

9. The Speedup Goal of the Divide and Conquer Framework

As a starting point for that purpose, we recall the divide and conquer algorithmic paradigm ([23], p.v), which is frequently considered as the leitmotiv of domain decomposition methods [21]. The divide and conquer strategy ( DC-strategy ) consists in dividing the domain of definition of the scientific or engineering model into small pieces and then send each one of them to different processors. If p is the number of subdomains of the domain decomposition, the size of each piece is approximately equal to n p ; hence, smaller than n when p  and much smaller than n , when p is large. This is the procedure for reducing the size of the problems treated by the different processors when the DC-strategy is applied. Of course, for the divide and conquer strategy being effective it is necessary and sufficient that each one of the local problems be independent of all others. Such a condition (each local problem being independent of all others) is seldom fulfilled in practice, and it will be referred to as the

DC-paradigm . Adopting the

DC-paradigm as a guide in the development of software implies to strive to construct algorithms in which the local problems are as independent from each other as possible. Thereby, we mention that the derived-vector space framework, which in the outstanding numerical experiments here reported has been so effective, was developed following the

DC-paradigm . Since the approximate size of each local problem is n p , when all them are independent,   T n p would be the approximate execution-time for each one of them, which when the computation is carried out in parallel is also the global execution-time . Therefore, in the

DC-framework we define the execution-time goal ( DC-execution-time goal ), to be denoted by   , DC T p n , as:     , 1, DC T p n T n p  (9.1) Correspondingly, the speedup goal for the DC-approach is defined to be      DC T nS p n T n p  (9.2) and the DC-efficiency is given by         , 1,, , ,

DC DC

S p n T n pE p n S p n T p n   (9.3) In Table 3, to illustrate the

Divide and Conquer concepts, they have been computed in the p n p   , DC T p n   , DC S p n p     , DC p S p n p 

1 10 DC-execution-time goal and the

DC-speedup goal conditions of the numerical experiments that prompted this paper. The first and second columns (counted from left to right) contain the number of processors and the degrees of freedom of the local problems, respectively. The third column yields the DC-execution time goals of the local problems, which were obtained through numerical experiments; for each p only one of the local problems was solved numerically (and only one of the processors was used). Once   , DC T p n was known,   , DC S p n was computed applying straightforward formulas. The local solvers used in our numerical experiments were banded LU decompositions, whose algorithmic complexity turned out to be p and is given in the fifth column. An interesting fact, in the numerical experiments here reported, is that the algorithmic complexity approximates   , DC S p n , and the last column of Table 3 gives the corresponding relative errors in percentage associated with such an approximation. For the purpose we have in mind, such errors are admissible.

10. Incorporating the outstanding results in the DC-framework

In this Section the results of our numerical and computational experiments contained in Table 1, are incorporated in the DC-framework. Table 4 that was so built follows. The seventh p p   , T p n   , S p n   , DC T p n   , DC S p n         ,, ,, ,

DCDC DC

E p nT p n T p nS p n S p n    , S p n p

1 1 29,278 1 29,278 1 100% 100% 16 256 178 164.5   , DC S p n by p . By inspection of this table, it is seen that the outstanding results that prompted this paper look perfectly normal when they are displayed in the DC-framework . This shows that the

DC-framework is adequate for accommodating the outstanding numerical and computational results that we have obtained using the

DVS-BDDC algorithm.

11. Restrictions on parallel performance imposed by the standard framework

Assuming     , , S S p n p S p n   is limitative and this Section together with the next one we explore more thoroughly the restrictions on parallel performance that such an assumption imposes. To start with, the standard speedup goal , p , and the DC-speedup goal ,   , DC S p n , corresponding to the set of experiments we have been discussing, are compared. Their ratios are shown Table 5, where the values of   , DC S p n are taken from Table 4. p   , S S p n   , DC S p n      , ,,

DC SDC

S p n S p nS p n p       , ,, S DCDC

S p n S p np S p n 

1 1 1 1 1 16 16 233.9 14.6 .0685 25 25 speedup goals

By inspection of Table 5, it is seen that the standard goal-speedups are much smaller than the goal-DC-speedups , and probably too conservative and restrictive. Table 6, which follows, shows the bounds of performance for any software that satisfies the p     , 1, S T p n T n p    , S S p n p    , DC T p n   , DC S p n     , ,

SDC DC

E p n p S p n      S T    S S    SDC E    S T    S S    SDC E    S T    S S    SDC E    S T    S S n    SDC E    S T    S S n    SDC E  Table 6. Restrictions of performance for standard software restriction   , S p n p  . The last column of this table shows such an assumption limits severely the DC-efficiency that one can hope for, when any of the standard methods is applied, including BDDC and FETI-DP [22].

12. Additional comparisons

To have a clearer appreciation of the relevance of the limitations imposed by the standard framework, which have been established in Section 9, a direct comparison with the results obtained using the DVS-BDDC, which are given in Table 3, can help. Such a comparison is highlighted in Table 7. Method \ p

16 25 64 256 400

DVS-BDDC DC-efficiencies

DC-efficiencies: Bounds for standard

DC-efficiencies of standard and DVS-BDDC software In summary, for all the numerical and computational experiments here discussed, the efficiency one can hope to obtain using standard software is only a small fraction of that, which is obtained when the DVS-BDDC algorithm is applied. From all the above discussion, we draw the conclusion that adopting the definition   , S S p n p  , as is usually done in domain decomposition methods, is too conservative and hinders drastically the performance of methods developed within such a framework.

13. Conclusions

This paper communicates the outstanding results of numerical experiments in which the DVS-BDDC algorithm [2] yields speedups exceeding the number of processors multiplied by a large factor; 73.2 is the largest obtained in our experiments. From the analysis here presented the following conclusions are drawn: 1. The belief that the speedup (or, acceleration) is always less or equal to p (the number of processors) is incorrect. Accelerations much larger than p are not only feasible, but have been achieved using the DVS-BDDC algorithm; 2. The performance goal that research on DDM has intended up to now, besides being too small, has been very restrictive for the software developed in that framework; and 3. The Divide and Conquer framework , here introduced, is more adequate for accommodating results such as those that have been obtained using the

DVS-BDDC algorithm. Based on these conclusions it recommended that the

Divide and Conquer framework be adopted in future research on the applications of parallel computation to the solution of partial differential equations. Then, the performance goal is defined in terms of the execution time goal , as     , 1, DC T p n T n p  (13.1) Or, the speedup goal ,      DC T n S p n T n p  (13.2) Or, the divide and conquer efficiency :        

1, ,, , ,

DC DC

T n p S p nE p n T p n S p n   (13.3)

Acknowledgment

We want to thank DGTIC for its support and computational resources assigned to this research in the cluster Miztli, which is the supercomputer of the National Autonomous University of Mexico (UNAM) under project LANDCAD-UNAM-DGTIC-065.

REFERENCES Herrera I. & Contreras I. “Evidences that Software Based on Non-Overlapping Discretization is Most Efficient for Applying Highly Parallelized Supercomputers to Solving Partial Differential Equations” Chapter 1 of the book “High Performance Computing and Applications”, J. Xie et al . (Eds.), Lecture Notes in Computer Science (LNCS), Springer-Verlag, pp. 1-16, 2016. DOI: 10.1007/978-3-319-32557-6_1 2.

Herrera, I., de la Cruz L.M. and Rosas-Medina A. “Non-Overlapping Discretization Methods for Partial, Differential Equations”. NUMER METH PART D E, 30: 1427-1454, 2014 (Open source). 3.

Herrera, I. & Rosas-Medina A. “The Derived-Vector Space Framework and Four General Purposes Massively Parallel DDM Algorithms”, EABE (Engineering Analysis with Boundary Elements), pp-646-657, 2013. 4. Herrera I., Contreras I. An Innovative Tool for Effectively Applying Highly Parallelized Hardware to Problems of Elasticity Geofísica Internacional, 55 (1) pp., 39-53, 2015. 5.

Herrera, I. “Theory of Differential Equations in Discontinuous Piecewise-Defined-Functions”, NUMER METH PART D E, (3), pp 597-639, 2007. 6. Herrera I. and R. Yates “Unified Multipliers-Free Theory of Dual Primal Domain Decomposition Methods”. NUMER. METH. PART D. E. Eq. 25:552-581, 2009. 7.

Herrera, I. & Yates R. A. “The Multipliers-free Domain Decomposition Methods” NUMER. METH. PART D. E. 26 pp874-905, July 2010. (DOI 10.1002/num. 20462) 8.

Herrera, I. & Yates R. A. “The Multipliers-Free Dual Primal Domain Decomposition Methods for Nonsymmetric Matrices” NUMER. METH. PART D. E. (5) pp. 1262-1289, 2011. 9. Toselli A. and O. Widlund, “Domain decomposition methods- Algorithms and Theory”, Springer Series in Computational Mathematics, Springer-Verlag, Berlin, 2005, 450p. Dohrmann C.R., A preconditioner for substructuring based on constrained energy minimization. SIAM J. Sci. Comput. 25(1):246-258, 2003. 12.

Mandel J. and C. R. Dohrmann, Convergence of a balancing domain decomposition by constraints and energy minimization, Numer. Linear Algebra Appl., 10(7):639-659, 2003. 13.

Mandel J., Dohrmann C.R. and Tezaur R.,

An algebraic theory for primal and dual substructuring methods by constraints , Appl. Numer. Math., 54: 167-193, 2005. 14.

Farhat Ch., and Roux F. A method of finite element tearing and interconnecting and its parallel solution algorithm. Internat. J. Numer. Methods Engrg. 32:1205-1227, 1991. 15.

Mandel J. and Tezaur R. Convergence of a substructuring method with Lagrange multipliers. Numer. Math 73(4): 473-487, 1996. 16.

Farhat C., Lessoinne M. LeTallec P., Pierson K. and Rixen D. FETI-DP a dual-primal unified FETI method, Part I: A faster alternative to the two-level FETI method. Int. J. Numer. Methods Engrg. 50, pp 1523-1544, 2001. 17.

Farhat C., Lessoinne M. and Pierson K. A scalable dual-primal domain decomposition method, Numer. Linear Algebra Appl. 7, pp 687-714, 2000.

Klawonn A., Lanser M. & Rheinbach O. “Toward extremely scalable nonlinear domain decomposition methods for elliptic partial differential equations”, SIAM J. S

CI. COMPUT., Vol. 37, No.6, pp. C667-C696, 2015.

Smith B., P. Björstad & W. Gropp, “Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations”, Cambridge University Press, 621p, 1996 20.

Toivanen J., Avery P. and C. Farhat “A multilevel FETI-DP method and its performance for problems with billions of degrees of freedom”, Int. J. Numer Methods Eng. 2018; :661-682.

L. R. Scott, T. W. Clark, and B. Bagheri“Scientific Parlallel Computing”, Princeton University Press, 2005. 22.

Mathew, T. “Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations”, Lecture Notes in Computational Science and Engineering. Springer Publishing Company, Incorporated. 2008.23.