Re-evaluating scaling methods for distributed parallel systems
RRE-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 1
Re-evaluating scaling methodsfor distributed parallel systems
J´anos V´egh
Abstract —The paper explains why Amdahl’s Law shall beinterpreted specifically for distributed parallel systems and whyit generated so many debates, discussions, and abuses. We setup a general model and list many of the terms affecting parallelprocessing. We scrutinize the validity of neglecting certain termsin different approximations, with special emphasis on the famousscaling laws of parallel processing. We clarify that when using theright interpretation of terms, Amdahl’s Law is the governing lawof all kinds of parallel processing. Amdahl’s Law describes amongothers the history of supercomputing, the inherent performancelimitation of the different kinds of parallel processing and itis the basic Law of the ’modern computing’ paradigm, that thecomputing systems working under extreme computing conditionsare desperately needed.
I. I
NTRODUCTION A MDAHL in his famous paper [1], even in the title,wanted to draw the attention to that (as he has coinedthe wording) the Single Processor Approach (SPA) seriouslyshall limit the achievable computing performance, given that” the organization of a single computer has reached its limits ”and attempted to explain why it was so. Unfortunately, hissuccessors nearly completely misunderstood his intention.Rather than developing ” interconnection of a multiplicity ofcomputers in such a manner as to permit cooperative solution ”the followers used his idea only to derive the limitations ofcomputing systems built from components manufactured forthe SPA. The successors constructed his famous formula aswell, and unfortunately, attributed an inadequate meaning toits terms. The quick technical development suppressed thereal intention and meaning of the Law. When the computingneeds and possibilities reached the point where the the pre-cise meaning of the Law matters, the incorrect interpretationattributed to its terms did not describe the experiences, givingway to different other ’laws’ and scaling modes. With theproper interpretation, however, we show ” that Amdahl’s Lawis one of the few fundamental laws of computing ” [2]. Notonly of computing, but of all – even computing unrelated –partly parallelized otherwise sequential activities. This paperdiscusses only the consequences of the idea on scaling ofcomputer systems constructed in SPA; for the idea he wasthinking about, see [3], [4], [5].In section II, the considered scaling methods are shortlyreviewed, and some of their consequences discussed. Section
Projects no. 125547 has been implemented with the support providedfrom the National Research, Development and Innovation Fund of Hungary,financed under the K funding scheme.Kalim´anos BT, 4032 Debrecen, HungaryManuscript submitted to IEEE transactions February 16, 2020; revised ??,2020.
II-A, describes Amdahl’s idea shortly: and introduces hisfamous formula using our notations. In section II-B, wescrutinize the basic idea of the massively parallel processing,Gustafson’s idea. In section II-C we discuss the ”modernscaling law” [6], based on the the recently introduced ”moderncomputing” [7]; essentially Amdahl’s idea applied to themodern computing systems.Section III introduces a (by intention strongly simplified andnon-technical) model of the parallelized sequential process-ing. The model visualizes the meaning of the ”parallelizableportion” and enables us to draw the region of validity ofthe ”strong” and ”weak” scaling methods. As those scalingmethods and principles are relevant to all fields of paralleland distributed processing, we demonstrate the application ofthe presented formalism for different tasks in section IV.II. T
HE SCALING METHODS
The scaling methods used in the field are necessarily ap-proximations to the general model presented in section III.We discuss the nature and validity of those approximations,furthermore introduce the notations and the formalism.
A. Amdahl’s Law
Amdahl’s Law (called also the ’strong scaling’) is usuallyformulated with an equation such as S − = (1 − α ) + α/N (1)where N is the number of parallelized code fragments, α isthe ratio of the parallelizable fraction to the total (so (1 − α ) is the ”serial percentage”), S is a measurable speedup. Thatis, Amdahl’s Law considers a fixed-size problem , and the α portion of the task is distributed to the fellow processors.When calculating the speedup, one calculates S = (1 − α ) + α (1 − α ) + α/N = NN · (1 − α ) + α (2)However, as expressed in [8]: ” Even though Amdahl’s Law istheoretically correct, the serial percentage is not practicallyobtainable. ” That is, concerning S there is no doubt that itcorresponds to the ratio of the measured execution times , forthe non-parallelized and the parallelized case, respectively.But, what is the exact interpretation of α , and how can itbe used?Unfortunately, Amdahl used α with the meaning ” the frac-tion of the the number of instructions which permit paral-lelism ” in Fig. 1 he used as an illustration in his paper.The illustration refers to the case when ” around a point a r X i v : . [ c s . PF ] A p r E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 2 corresponding to 25% data management overhead and 10%of the problem operations forced to be sequential ”. At that”point”, there is no place to discuss more subtle details of theperformance affecting factors (otherwise mentioned by Am-dahl, such as ”boundaries are likely to be irregular; interiorsare inhomogenous; computations required may be dependenton the states of the variables at each point; propagationrates of different physical effects may be quite different ; therate of convergence or convergence at all may be stronglydependent on sweeping through the array along different axeson succeeding passes”). It is worth to notice that Amdahl hasforeseen issues with ”sparse” calculations (or in general: therole of data transfer ); furthermore that the physical size of thecomputer and the interconnection of the computing units alsomatters. The latter is crucial in the case of distributed systems.However, at that time (unlike today [9], [10]) the executiontime was strictly determined by the number of the executedinstructions. What he wanted to say was ” the fraction of thetime spent with executing the instructions which permit paral-lelism ” (at other places the correct expression ” the fraction ofthe computational load ” was used). This (unfortunately formu-lated) phrase ” has caused nearly three decades of confusion inthe parallel processing community. This confusion disappearswhen we use processing times in the formulations ” [8]. On oneside, the researchers guessed that Amdahl’s Law was validonly for software (for the executed instructions), and on theother side the other affecting factors, he mentioned but did notdiscuss in details, were forgotten .As expressed correctly in [8]: ”
For example, if the serialpercentage is to be derived from computational experiments,i.e., recording the total parallel elapsed time and the parallel-only elapsed time, then it can contain all overheads, suchas communication, synchronization, input/output and memoryaccess. The law offers no help to separate these factors. On theother hand, if we obtain the serial percentage by counting thenumber of total serial and parallel instructions in a program,then all other overheads are excluded. However, in this case,the predicted speedup may never agree with the experiments. ”Really, one can express α from Eq. (1) in terms measurableexperimentally as α = NN − S − S (3)That is this α eff value, the effective parallel portion , can bederived from the experimental data for the individual cases.Also, it is useful to express the efficiency with the pseudo-experimentally measurable E ( N, α ) = SN = 1 N · (1 − α ) + α = R Max R P eak (4)data, because for many parallelized sequential systems (in-cluding the TOP500 supercomputers) the efficiency (as R Max /R P eak ) and the number of processors N are provided.Reversing the relation, the value of α eff can be calculated as α ( E, N ) = E · N − E · ( N − (5) As seen, the efficiency is a two-parameter function (thecorresponding surface is shown in Fig. 1), demonstrativelyunderpinning that ” This decay in performance is not a fault ofthe architecture, but is dictated by the limited parallelism ” [11]and that the properly interpreted Amdahl’s Law perfectlydescribes its dependence on its variables. This proper inter-pretation also means that Amdahl’s Law (after the pinpointinggiven in section II-C) shall describe the behavior of systemsusing a variety of parallelism, see section IV.
B. Gustafson’s Law
Partly because of the outstanding achievements of the par-allelization technology, partly because of the issues around thepractical utilization of Amdahl’s Law, the ’weak scaling’ (alsocalled Gustafson’s Law [12]) was also introduced, meaningthat the computing resources grow proportionally with the tasksize . Its formulation was (using our notations) S = (1 − α ) + α · N (6)Similarly to the Amdahl’s Law, the efficiency can be derivedfor the Gustafson’s Law as (compare to Eq. (4)) E ( N, α ) = SN = α + (1 − α ) N (7)From these equations immediately follows that the speedup(the parallelization gain) increases linearly with the number ofprocessors, without limitation . This conclusion was launchedamid much fanfare. They imply, however, some more imme-diate conclusions, such as • some speedup can be measured even if no processor ispresent • the efficiency slightly increases with the number ofprocessors N (the more processors, the better efficacy) • the non-parallelizable portion of the job either shrinksas the number of processors grows, or despite that, thatportion is non-parallelizable, the system distributes itbetween the N processors • executing the extra instructions needed by the first pro-cessor to organize the joint work needs no time • all non-payload computing contributions such as com-munication (including network transfer), synchronization,input/output and memory access take no timeHowever, an error was made in deriving Eq. (6): the N − processors are idle waiting while the first one is executing thesequential-only portion . Because of this, the time that servesas the base for calculating the speed up in the case of using N processors T N = (1 − α ) processing + α · N +(1 − α ) · ( N − idle = (1 − α ) · N + α · N = N For the meaning of the terms in [12], the author used thewording ”is the amount of time spent (by a serial processor)”.
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 3
That is, before fixing the arithmetic error, impossible con-clusions follow, after fixing it, the conceptual error comesto light: the weak scaling assumes that the single-processorefficiency can be transferred to the parallelized sequentialsubsystems without loss , i.e., that the efficacy of a systemcomprising N single-thread processors remains the same thanthat of a single-thread processor. This statement stronglycontradicts the experienced ’efficiency’ of the parallelizedsystems, not speaking about the ’different efficiencies’ [6], seealso Fig. 1.That is, the Gustafson’s Law is simply a misinterpretation ofthe argument α : a simple function form transforms Gustafson’Law to Amdahl’s Law [8]. After making that transformation,the two (apparently very different) laws become identical.However, as suspected by [8]: ” Gustafson’s formulation givesan illusion that as if N can increase indefinitely ”. This illusionled to the moon-shot of targeting to build supercomputerswith computing performance well above the feasible (andreasonable) size and may lead to false conclusions in thecase of using clouds. The ’real scaling’ also explains why onecould not reveal this illusion for decades and why it provokeddecades-long debates in the community.
C. Real scaling
The role of α was theoretically established [13] and thephenomenon itself, that the efficiency (in contrast with Eq. (7)) decreases as the number of the processing units increases , isknown since decades [11] (although it was not formulated inthe functional form given by Eq. (4)). In the past decades,the theory faded; mainly due to the quick development of theparallelization technology and single-processor performance.The community used the ’weak scaling’ approximation tocalculate the expected performance values, in many casesoutside its range of validity. The ’gold rush’ for building ex-ascale computers finally made obvious that under the extremeconditions represented by the need of millions of processorsthe used ’weak scaling’ leads to false conclusions: it ” canbe seen in our current situation where the historical ten-yearcadence between the attainment of megaflops, teraflops, andpetaflops has not been the case for exaflops ” [14]. It looks like,however, that in the feasibility studies of supercomputing usingparallelized sequential systems an analysis whether buildingcomputers of such size is feasible (and reasonable) remainedout of sight either in USA [15], [16] or in EU [17] or inJapan [18] or in China [19].Figure 1 depicts the two-parameter efficiency surface stem-ming out from Amdahl’s law (see Eq. (4)). On the surface,some measured efficiencies of the present top supercomputersare also depicted, just to illustrate some general rules. Tovalidate the model described in section III the data of therigorously verified supercomputer database [20] was used,as described in [6]. The High Performance Linpack (HPL) efficiencies are sitting on the surface, while the correspondingHigh Performance Conjugate Gradients (HPCG) values aremuch below those values. The conclusion drawn here was − − − − − N o o f c o r e s (1 − α HPLeff ) E ff i c i e n c y Dependence of E HPL and E HPCG on (1 − α HPLeff ) and N TOP5’2018.11
SummitSierraTaihulightTianhe-2K computer
Fig. 1. The 2-parameter efficiency surface (in function of the parallelizationefficiency measured by benchmark HPL and the number of the processingelements) as concluded from Amdahl’s Law (see Eq. (4)), in the first orderapproximation. Some sample efficiency values for some selected supercom-puters are shown, measured with benchmarks HPL and HPCG, respectively. that ” the supercomputers have two different efficiencies ” [21],because one cannot explain the experience in the frame of the”classic computing paradigm”.The
T aihulight and
K computer stand out from the”millions core” middle group. Thanks to its 0.3M cores,
K computer has the best efficiency for the
HP CG bench-mark, while
T aihulight with its 10M cores the worst one. Themiddle group follows the rules [6]. For
HP L benchmark: themore cores, the lower efficiency. For
HP CG benchmark: the”roofline” [22] of that communication intensity was alreadyreached, all computers have about the same efficiency.According to Eq. (4) the efficiency can be interpretedin terms of α and N , and the payload performance of aparallelized sequential computing system can be calculated as P ( N, α ) = N · P single N · (1 − α ) + α (8)This simple formula explains why the payload performance isnot a linear function of the nominal performance and why inthe case of very good parallelization ( (1 − α ) (cid:28) ) and low N ,one cannot notice the nonlinearity. The functional form of thedependence discovers a surprising analogy shown in details inTable I and Fig 2.The right side of Fig. 2 reveals why the nonlinearity ofthe dependence of the payload performance on the nominalperformance was not noticeable earlier: in the age of K processors, the effect was thousands of times smaller thanin the age of M processors, and the increase seemed to belinear. But anyhow: although one can use the ’weak scaling’up to around up to a few PFlops (and low communicationintensity), it is surely not valid any more . How much thenonlinearity manifests, depends on the type of the workload ofthe computing system [6]. That is, according Eq. (8) defines E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 4 time ( s ) s p ee d ( m / s ) Relativistic speed of body accelerated by ’g’ v ( t ) , n = 1 v ( t ) , n = 2 . v ( t ) , n = 5 − − − − − − − − Nominal performance (EFlops) P a y l oa dp e r f o r m a n ce ( E F l o p s ) Payload performances of N cores @100GFlops alpha = 1 e − alpha = 1 e − alpha = 1 e − alpha = 1 e − alpha = 1 e − Fig. 2. The analogy between how the speed of light changes the linearity of speed dependence at extremely large speed values and that how the value ofperformance changes the linearity of performance dependence at extremely large performance values.TABLE IT
HE ANALOGY OF ADDING SPEEDS IN PHYSICS AND ADDINGPERFORMANCES IN COMPUTING , IN THE CLASSIC AND MODERNPARADIGM , RESPECTIVELY . I
N BOTH CASES A CORRECTION TERM ISINTRODUCED , THAT PROVIDES NOTICEABLE EFFECT ONLY AT EXTREMELYLARGE VALUES . Physics ComputingAdding of speeds Adding of performanceClassic Classic v ( t ) = t · a P erf total ( N ) = N · P single t = time N = number of coresa = acceleration P single = Single-performancen = optical density communicationc = Light Speed α = parallelismModern (relativistic) Modern (Amdahl-aware) [7], see Eq. (4) v ( t ) = t · a (cid:115) (cid:18) t · ac/n (cid:19) P ( N ) = N · P single N · (1 − α ) + α the ’real scaling’. The linear approximation (according to the’weak scaling’) is not valid any more , although it was a goodapproximation at lower performance values and for shorterextrapolation distances.Notice that in this section, we assumed that α does notdepend on N . This assumption is undoubtedly valid for lownumber of processors, and unquestionably not valid for thecutting-edge supercomputers. That is, as discussed below, thebad news is that the increase of the payload performance isnot linear in the function of nominal performance (as wouldbe expected based on ’weak scaling’), but has a performancelimit at which it saturates (according to the first-order ap-proximation) or starts to decrease (according to the second-order approximation). The parallelized sequential processinghas different rules of game [11], [7] : the performance gain (”the speedup”) has its inherent bounds [6] .III. A NON - TECHNICAL MODEL OF PARALLELIZEDSEQUENTIAL OPERATION
To understand why the different ’scaling’ methods areapproximations with a limited range of validity; we set upa simple non-technical model.The speedup measurements are simple time measurements (although they need careful handling and proper interpretation,see good textbooks such as [23]): a benchmark executesa standardized set of machine instructions (a large numberof times), and divides the known number of operations bythe measurement time; for both the single-processor and thedistributed parallelized sequential systems. In the latter case,however, the joint work must also be organized, implementedwith additional machine instructions and additional executiontime, forming an overhead. This additional activity is theorigin of the inherent efficiency of the parallelized sequentialsystems : one of the processors orchestrates the joint operation,the others are idle waiting. At this point, the ” dark perfor-mance ” appears: the processing units are ready to operate,consume power, but do not make any payload work.A closer analysis revealed that one of the essential pre-requisites to applying Amdahl’s Law is not strictly fulfilledeven by the Amdahl’s Law . The reason is that ” It requiresthe serial algorithm to retain its structure such that the samenumber of instructions are processed by both the serial andthe parallel implementations for the same input ” [8]. Becauseof this, Amdahl’s Law itself is an approximation. In itsoriginal form, we can call it the first order approximation to Sometimes also secondary merits, such as GFlops/Watt or GFlops/USDare also derived
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 5 ( Sunway/T aihulight ) α = 1 − · − α = 1 − . · − T otal = 10 clocks N cores = 10 N cores = 10 R Max R P eak = N · (1 − α )+ α = · − +1 = 0 . R Max R P eak = N · (1 − α )+ α = 0 . P roc T i m e ( n o t p r o p o r t i o n a l ) α = P ayloadT otal P P P P P Access
Initiation
Software
Pre OS Pre T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D J ust waitingJ ust waiting OS Post
Software
Post
Access
Termination P a y l o a d T o t a l E x t e n d e d Fig. 3. A non-technical, simplified model of parallelized sequential computing operations. The contributions of the model component
XXX to α (sometimesused as α eff to emphasize that it is an effective, empirical value) will be denoted by α XXXeff in the text. Notice the different nature of those contributions.They have only one common feature: they all consume time . The vertical scale displays the actual activity for processing units shown on the horizontal scale.
Amdahl’s Law. The approximation takes that compared to thenon-parallelizable payload work, organizing the joint work isnegligible. The validity of this assumption is limited to a verylow number of cores and a relatively high ratio of overhead.Recall that in the age of Amdahl, the non-payload workloadratio was in the range of dozens of percent, so some extra workdid not make a considerable difference. Today, as we discussbelow, the ratio of the overhead is by orders of magnitudelower, while the number of cores is by orders of magnitudehigher (see also the parameters of the different configurationin Fig. 3). This latter aspect we consider as the second orderapproximation to Amdahl’s Law; for more details see [6].Amdahl’s idea is to put everything that we cannot par-allelize, i.e., distribute between the fellow processing units,into the sequential-only fraction . In the spirit of this idea, fordescribing the parallel operation of sequentially working units,we prepared the model depicted in Figure 3. The technicalimplementations of the different parallelization methods showup virtually infinite variety [24], so we present here a (byintention) strongly simplified model. The non-parallelizablecontributions are virtually classified (sometimes contracted)and shown as general contribution terms in the figure. In this way, the model is general enough to discuss some case studiesof parallelly working systems qualitatively, neglecting differentcontributions as possible. One can convert the model to atechnical (quantitative) one via interpreting the contributionsin technical terms, although with some obvious limitations.As Figure 3 shows, in the parallel operating mode (in addi-tion to the calculation, furthermore the communication of databetween the processing units) both the software (in this sense:computation and communication, including data access) andthe hardware (interconnection, accelerator latency) contributeto the execution time , i.e., one must consider both of thosecomponents in Amdahl’s Law. This statement is not new,again: see [1], [8].The non-parallelizable (i.e. apparently sequential) part com-prises contributions from Hardware (HW), Operating System(OS), Software (SW) and Propagation Delay (PD) (the ” prop-agation rates of different physical effects ”), and also some ac-cess time is needed for reaching the parallelized system. Thisseparation is rather conceptual than strict, although dedicatedmeasurements can reveal their role, at least approximately.Some features can be implemented in either SW or HW, orshared between them, and also some sequential activities may
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 6 happen partly parallel with each other. The relative weights ofthe contributions are very different for different parallelizedsystems, and even within those cases depend on many specificfactors, so in every single parallelization case, it requires acareful analysis . The SW activity represents what was assumedby Amdahl as the total sequential fraction. What did not yetexist in the age of Amdahl, the non-determinism of the modernHW systems [9] [10] also contributes to the non-parallelizableportion of the task: the slowest unit defines the resultingexecution time of the parallelly working processing elements.Also, notice that optimization possibilities are present in thesystem; for example, see in Fig. 3 how the contribution ofclasses propagation delay and looping delay can be combinedto achieve a better timing.Our model assumes no interaction between the processesrunning on the parallelized systems in addition to the necessaryminimum: starting and terminating the otherwise independentprocesses, which take parameters at the beginning and returna result at the end. It can, however, be trivially extended to themore general case when processes must share some resources(like a database, which shall provide different records forthe different processes), either implicitly or explicitly. Theconcurrent objects have their inherent sequentiality [25], andthe synchronization and communication between those objectsconsiderably increase [26] the non-parallelizable portion (i.e.,contribution to (1 − α SWeff ) or (1 − α OSeff ) ), so in the case ofthe massive number of processors, special attention must bedevoted to their role on the efficiency of the application onthe parallelized system.In the case of distributed systems, the physical size of thecomputing system also matters: the processor, connected tothe first one with a cable of length of dozens of meters,must spend several hundred clock cycles with waiting, onlybecause of the finite speed of propagation of light, toppedby the latency time and hoppings of the interconnection (notmentioning geographically distributed computer systems, suchas some clouds, connected through general-purpose networks). This aspect is completely neglected in the ’weak scaling’approximation.
Detailed calculations are given in [27] .After reaching a certain number of processors, there isno more increase in the payload fraction when adding moreprocessors: the first fellow processor already finished the taskand is idle waiting, while the last one is still idle waiting forthe start command. One can increase this limiting number byorganizing the processors into clusters; then, the first computermust speak directly only to the head of the cluster. Anotherway is to distribute the job near to the processing units, eitherinside the processor [28] or using processors to let do the jobby the processing units of a GPGPU.This looping contribution is not considerable (and so: notnoticeable) at a low number of processing units, but can bea dominating factor at a high number of processing units.This ”high number” was a few dozens at the time of writingthe paper [11], today it may be in the order of a fewmillions. Considering the effect of the looping contributionis the borderline between the first and second-order ap-proximations in modeling the performance: the housekeepingkeeps growing with the growing number of processors, while the resulting performance does not increase anymore. Even,the housekeeping gradually becomes the dominating factorof the performance limitation, and leads to a decrease inthe payload performance: ” there comes a point when usingmore processors . . . increases the execution time rather thanreducing it ” [11]. That is, the first-order approximation resultsonly in saturated performance; the second-order approximationleads to reaching an inflection point followed by decreasingperformance and efficiency.IV. A
PPLICATION FIELDS
According to the model, one expects (1 − α eff ) to de-scribe the fraction of the total (even unintended or onlyapparently) sequential part in any HW/SW system, and itis a sensitive measure of disturbances and inefficiencies ofparallelization [27]. This value can be used as the merit tocompare setups, computers manufactured in different ageswith different technologies, conditions of network operation,the algorithm for communication within a closed chip, SWload balancing.In this section (except section IV-E) we assume that theparallelized computing system is accessible in a negligibletime, and that the parallelized system under study is properlydefined. We do not care whether the one-time contributions(such as initiating the data structures and starting the calcula-tions) are done by the user SW or by the OS; furthermore weassume that we repeat the payload calculation so many timesthat we can neglect the one-time contributions. A. Load balancing compiler
Today, mainly because of the more and more widespreadutilization of multi-core processors, more and more appli-cations are considered to be re-implemented in multi-coreaware form. Because it is a serious (and expensive!) effort,before deciding to start such a re-implementation, one needs toguess the speed gain. After finishing the re-implementation, itwould be desirable to measure whether we achieved the goal.A method to find out during development, whether we canincrease further parallelization using a reasonable amount ofdevelopment work, would be highly desirable. The achievablespeed gain depends on both the structure of the code and thehardware architecture, so one must scrutinize all those aspects.A compiler making load balancing of an originally se-quential code for different number of cores is described andvalidated in paper [29], by running the executable code onplatforms having different numbers of cores. In terms ofefficiency, the results they presented have common featuresand can be discussed together.The left subfigure of Fig. 4 (Fig 8 in [29]) displays theirresults in function of the number of cores, using the figureof merit the authors used, the efficiency E (see Equ. (4)).The data displayed in the figures are derived simply throughreading back diagram values from the mentioned figuresin [29], so they may not be accurate. However, they areaccurate enough to support our conclusions.Their first example shows the results of implementing paral-lelized processing of an audio stream manually, with an initial E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 7 . . . . Number of processors -( S p ee dup / N oo f p r o ce ss o r s ) Audio stream 1Audio stream 2Radar initialRadar improved − − − Number of processors - α e ff Audio stream 1Audio stream 2Radar initialRadar improved
Fig. 4. Relative speedup (left side) and ( − α eff ) (right side) values, measured running the audio and radar processing on different number of cores. [29] (first attempt), and more careful (having already experiencedprogrammers) implementation. For the two different process-ings of audio streams, using efficiency E as merit enablesonly to claim a qualitative statement about load balancing, that” The higher number of parallel processes in Audio-2 givesbetter results ”, because the Audio-2 diagram decreases lesssteeply than Audio-1. In the first implementation, where theprogrammer had no previous experience with parallelization,the efficiency quickly drops with the increasing number ofcores. In the second round, with experiences from the firstimplementation, the loss is much less, so − E rises lessspeedily.Their second example is processing radar signals. Withoutswitching in their compiler the load balancing optimization on,the slope of the − E diagram line is much more significant. Itseems to be unavoidable that as the number of cores increases,the efficiency (according to Eq. (4)) decreases, even at sucha low number of cores. Both examples leave the questionsopen whether further improvements are possible or whetherthe parallelization is uniform in function of the number ofcores.In the right subfigure of Fig. 4 (Fig. 10 in [29]) the diagramsshow the (1 − α eff ) values, derived from the same data. Incontrast with the left side, these values are nearly constant(at least within the measurement data readback error) whichmeans that the derived merit value is characteristic of thesystem. By recalling Equ. (1) one can identify this parameteras the resulting non-parallelizable part of the activity, which –even with careful balancing – one cannot distribute betweenthe cores, and cannot be reduced.In the light of this, one can conclude that both the program-mer in the case of audio stream and the compiler in the caseof radar signals correctly identified and reduced the amountof non-parallelizable activity. The α eff is practically constantin function of the number of cores, nearly all optimizationpossibilities found and they hit the wall due to the unavoidablecontribution of non-parallelizable software contributions. Thebetter parallelization leads to lower (1 − α eff ) values, and lessscatter in the function of the number of cores. The uniformityof the values also make highly probable, that in the case of audio streams further optimization can be done, at least for 6-core and 8-core systems, while the processing of radar signalsreached its bounds.Note that we must not compare the absolute values foranalyzing different programs: they represent the sequential-only part of two programs, which may be different. It lookslike that the (1 − α eff ) imperfectness can be reduced to about − with software methods of parallelization. B. Measuring the efficiency of the on-chip networking
It is not a trivial task to find out the subtle points of on-chip networking, because of both the limited accessibility andthe low number of processing units. The merit developed here,however, can also help in that case; although the available non-dedicated measurements enable us to draw only conclusionsof limited accuracy.In [30] the authors compare different communication strate-gies their Particle Swarm Optimization (PSO) uses whenminimizing Rosenbrock’s function and Rastrigin’s functions,respectively. From their data, we calculated the corresponding α eff values and displayed them in Fig. 5. The fluctuationsseen in the figure show the limitations of the (otherwise ex-cellent) measurement precision; for this type of investigation,one would need much longer measurement time.The contribution of the OS cannot be separated, again,from SW contribution. Although the precision of the availabledata does not enable to make a detailed analysis of thebehavior of the scaling and to qualify the communicationmethod exhaustively, some observations we can make. Whenutilizing only two cores, the variety of ways to communicateis minimal. In this case, all communication methods deliveredthe same α eff value, proving the self-consistency of both themodel and the measurement. The values of α eff ( ∗ − and ∗ − ) deviate considerably for the two minimizationmethods, however. We attribute this deviation to the differentstructures ( α SWeff ) of the two applications. As the diagramsof the Rastrigin method show, the contribution due to thepropagation delay can be in the order of ∗ − ; which is con-siderable for the Rastrigin method, but not for the Rosenbrock E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 8 − − − Number of processors - α e ff Rosenbrock minimization, Ring method1 − ef f iciency efficiency slope − α − − − Number of processors - α e ff Rastrigin minimization, Ring method1 − ef f iciency efficiency slope − α − − − Number of processors - α e ff Rosenbrock minimization, Neigbourhood method1 − ef f iciency efficiency slope − α − − − Number of processors - α e ff Rastrigin minimization, Neigbourhood method1 − ef f iciency efficiency slope − α − − − Number of processors - α e ff Rosenbrock minimization, Broadcast method1 − ef f iciency efficiency slope − α − − − Number of processors - α e ff Rastrigin minimization, Broadcast method1 − ef f iciency efficiency slope − α Fig. 5. Comparing efficiency, efficiency slope and α eff for different communication strategies when running two minimization task on SoC by [30] E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 9 YearRanking P e r f o r m a n c e ga i n Fig. 6. The history of supercomputing in terms of performance gain method. The reason why for higher core numbers is that α eff is nearly constant in the case of the first two communicationmethods: one of the contributions dominates; although forthe Rastrigin method α SWeff , while for the Rastrigin method α communicationeff is the dominating term.A bit different is the case for the broadcast-type communi-cation, for both types of minimization methods: the resulting (1 − α eff ) increases with the increasing number of cores.Here the reason is that the number of collisions (and sothe time spent with waiting for repeating) increases with thenumber of cores. This contribution increasingly dominates forthe Rastrigin case and increases moderately the already high (1 − α eff ) value at a high number of cores, while at a lownumber of cores the (1 − α SWeff ) value persists in dominatingfor the Rosenbrock case. C. The history of supercomputing
The TOP500 database [20] provides all needed data tocalculate α , independently from the date of manufacturing,technology, manufacturer, number, and kind of processors.We can use the parallelization efficiency in studying (amongothers) the history of supercomputing.During the past quarter of a century, the proportion ofthe contributions changed considerably: today the numberof processors is thousands of times higher than it was aquarter of a century ago. The growing physical size and thehigher processing speed increased the role of the propagationoverhead, furthermore the large number of processing unitsstrongly amplified the role of the looping overhead. As aresult of the technical development, the phenomenon on theperformance limitation [11] returned in a technically differentform at a much higher number of processors . As discussed,except for an extremely high number of processors, it can beassumed that α is independent from the number of processors.Equ. (5) can be used to derive quickly the value of α from thevalues of parameters R Max /R P eak and the number of cores N . D. The effect technology change in supercomputing
As expressed by Eq. (8), the resulting performance ofparallelized computing systems depends on both the single-processor performance and performance gain (mainly the per-fectness of the parallelization). To separate these two factors,Fig. 6 displays the performance gain of the supercomputers inthe function of their year of construction and the ranking in thegiven year. The ”hillside” reflects the enormous developmentof the parallelization technology. Unfortunately, the differentindividual factors (such as interconnection quality, using ac-celerators and clustering, or using slightly different computingparadigm) cannot be separated in this way, although even inthis figure some limited validity conclusions can be drawn.One can localize two ’plateaus’ before the year 2000 andafter the year 2015, unfortunately, underpinning Amdahl’sLaw and refuting Gustafson’s Law. The values between 2000and 2010 demonstrate the development of the interconnectiontechnology (for a more detailed analysis, see Fig. 6 in [6]).Before 2010 running the benchmark on a top supercom-puter was a communication-bound task. Since 2015 it is acomputing-bound task. The appearance of accelerators around2011 caused some ’humps’, and both the excellent clusteringand using ’cooperating cores’ [28] increased the achievedperformance gain. The wide variety of technical components,manufacturers, interconnections, processors, accelerators, doesnot enable us to make more detailed connections.Fortunately, both the validity of the ’real scaling’ and theaccuracy of the merit in predicting the future performancevalues can be demonstrated. The supercomputers usually donot have a long lifespan and several documented stages. Oneof the rare exceptions is the supercomputer
Piz Daint . Thedocumented lifetime spans over 6 years, and during that time,a different number of cores, without and with acceleration,using different accelerators, were used.Figure 7 depicts the performance and efficiency valuespublished during its lifetime, together with the diagram linespredicting (at the time of making the prediction) the valuesat the higher nominal performance values. The left subfigureshows how the changes made in the configuration affected theefficiency (the timeline starts in the top right corner, and a lineconnects the consecutive stages).In the right subfigure, the bubbles represent data publishedin the adjacent editions of the TOP500 lists, the diagram linescrossing them are the predictions made from that snapshot.We can compare the predicted values to the value publishedin the next list. It is especially remarkable that introducingGPGPU acceleration resulted only in a slight increase (ingood agreement with [32] and [31]) compared to the valueexpected based purely on the increase of the number ofcores. Between the samplings more than one parameter waschanged, that is, we cannot demonstrate the net effect of achange. However, the measured data sufficiently underpin ourlimited validity conclusions and show that the theory correctlydescribes the tendency of the development of the performanceand the efficiency. The predicted performance values are alsoreasonably accurate.Introducing a GPU accelerator is a one-time performance
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 10 − − − . . . N o o f c o r e s (1 − α eff ) E ff i c i e n c y Dependence of E HPL on (1 − α eff ) and N Piz Daint2012/112013/062013/112016/112017/062018/11 − − − − − − − − R P eak (exaFLOPS) R M a x ( e x a F L O P S ) Development of R HPLMax for
P izDaint
Supercomputer
Xeon E5-2690 + NVIDIA Tesla P100 (2018)Xeon E5-2690 + NVIDIA Tesla P100 (2017)Xeon E5-2690 + NVIDIA Tesla P100 (2016)Xeon E5-2670 + NVIDIA K20x (2013)Xeon E5-2670 (2013)Xeon E5-2670 (2012)
Fig. 7. The history of supercomputer Piz Daint in terms of efficiency and payload performance. The left subfigure shows how the efficiency changed asthe developers proceeded towards higher performance. The right subfigure shows the reported performance data (the bubbles), together with the diagram linecalculated from the value as described above. Compare the value ot the diagram line to the measured performance data in the next reported stage. increase step [31], and cannot be taken into account by thetheory. Notice that introducing the accelerator increased thepayload performance, but decreased the efficiency (copyingdata from one address space to another increases latency).Changing the accelerator to another type with slightly higherperformance (but higher latency due to the larger GPGPUmemory) caused a slight decrease in the efficiency.
E. The effect of not considering the access time
In [33] the authors benchmarked some commercially avail-able cloud services, fortunately using HPL benchmark. Fig. 8shows on the left side the efficiency (i.e. R Max R P eak ), on the rightside, the (1 − α ) values, in the function of the number ofprocessors in the used configuration. One can immediatelynotice on one side that the values of R Max R P eak are considerablylower than unity, even for a very low number of cores. Onthe other side, the (1 − α ) values steeply decrease as thenumber of cores increases , although the model contains onlycontributions which may only increase as the number of coresincreases.The benchmark HPL characterizes the setup, so the bench-mark is chosen correctly. When acquiring measurement data,in the case of clouds, the access time must also be considered,see Fig. 3. If one measures the time on the client’s computer(and this is what is possible using those services), one usesthe time Extended in the calculation in place of
Total . That is,the ’device under test’ is chosen improperly.This artifact is responsible for both mentioned differences.The efficiency measured in this way would not achieve 100 %even on a system comprising only one single processor. Since α measures the average utilization of processors, this foreigncontribution is divided by the number of processors, so withincreasing the number of processors, the relative weight ofthis foreign contribution decreases, causing to decrease the calculated value of (1 − α ) . Since the access is providedthrough the Internet where the operation is stochastic, the mea-surements cannot be as accurate as in purpose-built systems .Some qualitative conclusions of limited validity, however, canbe drawn even from those data. At such a low number of processors neither of the contri-butions depending on the processor number is considerable,so one can expect that in the case of correct measurement (1 − α ) would be constant. So, extrapolating the diagramlines of (1 − α ) to the value corresponding to a one-processorsystem, one can see that both for Edison supercomputer andAzure A series grid (and maybe also Rackspace) the expectedvalue is approaching unity (but obviously below it). From theslope of the curve (increasing the denominator 1000 times, (1 − α ) reduces to − ), and one can even find out that (1 − α ) should be around − . Based on these data, one canagree with the conclusion that –on a good grid– benchmarkHPCG can run as effectively as on the supercomputer theyused. One should note, however, that (1 − α ) is about 3 ordersof magnitude better for TOP500 class supercomputers, butthis makes a difference only for HPL class benchmarks andonly at a large number of processors. This conclusion can bemisleading: whether a high-performance cloud can replacea supercomputer in solving a task, strongly depends on thenumber of cores , because of the different α values.Note that in the case of AWS grids and Azure F series the α OS + SWeff starts at about − , and this is reflected by the factthat their efficiency drops quickly as the number of the coresincreases. Interesting to note that ranking based on α is justthe opposite of the ranking based on efficiency (and stronglycorrelates with the price of the service). A long term systematic study [34] derived the results that measured datashow dozens of percent of the variation in long term run, and also unexpectedvariation in short term run.
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 11 − Number of processors E ffi c i e n c y AWS with HTAWSAWS with PG 10 − − − Number of processors - α e ff AWS with HTAWSAWS with PG10 − Number of processors E ffi c i e n c y Azure F-seriesAzure A-seriesAzure H-series 10 − − − Number of processors - α e ff Azure F-seriesAzure A-seriesAzure H-series10 − Number of processors E ffi c i e n c y Rackspace Compute1-60SoftLayerEdison supercomputer 10 − − − Number of processors - α e ff Rackspace Compute1-60SoftLayerEdison supercomputer
Fig. 8. The effect of neglecting the access time when measuring efficiency of some cloud services
E-EVALUATING SCALING METHODS FOR DISTRIBUTED PARALLEL SYSTEMS, VOL. 14, NO. 8, AUGUST 2020 12
One can also extrapolate the efficiency values to the pointcorresponding to one core only. In the case of measurementwith no such artifact, the backprojected efficiency value shouldbe around unity. If the measurement artifact is present, it isnot the case; the values deviate by a factor up to 2. The back-projected (1 − α ) values are much more consistent: they tendto hit the value of unity. Sometimes it is lower (meaning someother, foreign, performance loss), but in no case higher thanunity. V. C ONCLUSION
The scaling methods, mainly due to their simplicity, can beuseful when applied in the range of their validity. Given thatthey are approximations, we must scrutinize the validity of theomissions periodically. The approximations to the performanceof parallelized sequential systems routinely deployed the’weak scaling’ method to estimate the payload performanceof future, ever-larger scale system; without scrutinizing thevalidity of the method under the current technical situation.However, using this approximation (the incremental develop-ment) led to unexpected phenomena, failed supercomputersand unexpectedly low efficiency of the systems. The ’realscaling’ is in complete agreement with the experiences andmeasured values. R
EFERENCES[1] G. M. Amdahl, “Validity of the Single Processor Approach to AchievingLarge-Scale Computing Capabilities,” in
AFIPS Conference Proceed-ings , vol. 30, 1967, pp. 483–485.[2] J. M. Paul and B. H. Meyer, “Amdahl’s Law Revisited for Single ChipSystems,”
International Journal of Parallel Programming , vol. 35, no. 2,pp. 101–123, Apr 2007.[3] J. V´egh, “Introducing the Explicitly Many-Processor Approach,”
Paral-lel Computing , vol. 75, pp. 28 – 40, 2018.[4] J. V´egh, “EMPAthY86: A cycle accurate simulator for ExplicitlyMany-Processor Approach (EMPA) computer.” jul 2016. [Online].Available: https://github.com/jvegh/EMPAthY86[5] ——, “A configurable accelerator for manycores: the Explicitly Many-Processor Approach,”
ArXiv e-prints , Jul. 2016. [Online]. Available:http://adsabs.harvard.edu/abs/2016arXiv160701643V[6] J. V´egh, “Finally, how many efficiencies the supercomputers have?”
The Journal of Supercomputing ,2020. [Online]. Available: 10.1007/s11227-020-03210-4[7] J. V´egh and A. Tisan, “The need for modern computing paradigm:Science applied to computing,” in
Performance Analysis of Systems and Software (ISPASS), 2013 IEEEInternational Symposium on , April 2013, pp. 215–224.[10] P. Moln´ar and J. V´egh, “Measuring Performance of Processor Instruc-tions and Operating System Services in Soft Processor-Based Systems,”in , 2017, pp. 381–387.[11] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs formultiprocessors: Methodology and examples,”
Computer , vol. 26, no. 7,pp. 42–50, Jul. 1993.[12] J. L. Gustafson, “Reevaluating Amdahl’s Law,”
Commun. ACM , vol. 31,no. 5, pp. 532–533, May 1988.[13] A. H. Karp and H. P. Flatt, “Measuring Parallel Processor Performance,”
Commun. ACM \ nitrdgroups/images/b/b4/NSA DOE HPCTechMeetingReport.pdf, December 2016.[16] R. F. Service, “Design for U.S. exascale computer takes shape,” Science
Frontiers of Information Technology &Electronic Engineering \ -two-different-top-supercomputers, 2017.[22] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightfulvisual performance model for multicore architectures,” Commun. ACM ,vol. 52, no. 4, pp. 65–76, Apr. 2009.[23] D. Patterson and J. Hennessy, Eds.,
Computer Organization and design.RISC-V Edition . Morgan Kaufmann, 2017.[24] K. Hwang and N. Jotwani,
Advanced Computer Architecture: Paral-lelism, Scalability, Programmability , 3rd ed. Mc Graw Hill, 2016.[25] F. Ellen, D. Hendler, and N. Shavit, “On the Inherent Sequentiality ofConcurrent Objects,”
SIAM J. Comput. , vol. 43, no. 3, p. 519536, 2012.[26] L. Yavits, A. Morad, and R. Ginosar, “The effect of communicationand synchronization on Amdahl’s law in multicore systems,”
ParallelComputing , vol. 40, no. 1, pp. 1–16, 2014.[27] J. V´egh and P. Moln´ar, “How to measure perfectness of parallelizationin hardware/software systems,” in , 2017, pp. 394–399.[28] F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Co-operative computing techniques for a deeply fused and heterogeneousmany-core processor architecture,”
Journal of Computer Science andTechnology , vol. 30, no. 1, pp. 145–162, Jan 2015.[29] W. Sheng, S. Sch¨urmans, M. Odendahl, M. Bertsch, V. Volevach,R. Leupers, and G. Ascheid, “A compiler infrastructure for embeddedheterogeneous MPSoCs,”
Parallel Computing , vol. 40, pp. 51–68, 2014.[30] L. de Macedo Mourelle, N. Nedjah, and F. G. Pessanha,
Reconfigurableand Adaptive Computing: Theory and Applications . CRC press, 2016,ch. 5: Interprocess Communication via Crossbar for Shared MemorySystems-on-chip.[31] H. Simon, “Why we need Exascale and why we won’t get there by2020,” in
Exascale Radioastronomy Meeting \ Why we need Exascale and why we won’t get there by 2020[32] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen,N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal,and P. Dubey, “Debunking the 100X GPU vs. CPU myth: An Evaluationof Throughput Computing on CPU and GPU,” in
Proceedings of the37th Annual International Symposium on Computer Architecture , ser.ISCA ’10. New York, NY, USA: ACM, 2010, pp. 451–460. [Online].Available: http://doi.acm.org/10.1145/1815961.1816021[33] M. Mohammadi and T. Bazhirov, “Comparative Benchmarking of CloudComputing Vendors with High Performance Linpack,” in