Finally, how many efficiencies supercomputers have? And, what do they measure?
NNoname manuscript No. (will be inserted by the editor)
Finally, how many efficiencies the supercomputers have?
And, what do they measure?
J´anos V´egh
Received: date / Accepted: date
Abstract
Using an extremely large number of processing elements in com-puting systems leads to unexpected phenomena, such as different efficienciesof the same system for different tasks, that cannot be explained in the frame ofclassical computing paradigm. The simple non-technical (but considering thetemporal behavior of the components) model, introduced here, enables us toset up a frame and formalism, needed to explain those unexpected experiencesaround supercomputing. Introducing temporal behavior into computer sciencealso explains why only the extreme scale computing enabled us to reveal theexperienced limitations. The paper shows, that the degradation of efficiency ofparallelized sequential systems is a natural consequence of the classical com-puting paradigm, instead of being an engineering imperfectness. The workload,that supercomputers run, is much responsible for wasting energy, as well aslimiting the size and type of tasks. Case studies provide insight, how differentcontributions compete for dominating the resulting payload performance of acomputing system, and how enhancing the interconnection technology madecomputing+communication to dominate in defining the efficiency of super-computers. Our model also enables to derive predictions about supercomputerperformance limitations for the near future, as well as it provides hints forenhancing supercomputer components. Phenomena experienced in large-scalecomputing show interesting parallels with phenomena experienced in science,more than a century ago, and through their studying a modern science wasdeveloped.
Project no. 136496 has been implemented with the support provided from the NationalResearch, Development and Innovation Fund of Hungary, financed under the K fundingscheme.An extended and updated form of MS submitted to J. SupercomputingJ. V´eghKalim´anos BT, HungaryE-mail: [email protected] a r X i v : . [ c s . PF ] D ec J´anos V´egh
Keywords
Supercomputer performance · Parallelized sequential processing · Efficiency of supercomputers · Limitations of parallel processing · inherentlimitations of distributed processing · Behavior of extrem-scale systems · ANN performance · ANN efficiency · temporal behavior Given that dynamic growing of single-processor performance has stalled abouttwo decades ago [1], the only way to achieve the required high computingperformance remained parallelizing the work of a vast number of sequen-tially working single processors. However, as was very early predicted [2], anddecades later experimentally confirmed [3], the scaling of parallelized com-puting is not linear. Even, ” there comes a point when using more processors. . . actually increases the execution time rather than reducing it ” [3]. Paral-lelized sequential processing has different rules of game [3], [4] : its performancegain (”speedup”) has its inherent bounds [5] .Akin to as laws of science limit the performance of single-thread proces-sors [6], the commonly used computing paradigm (through its technical imple-mentation) limits the payload performance of supercomputers [4] . On one side,experts expected performance to achieve the magic 1 Eflops around year 2020,see Figure 1 in [7] . ” The performance increase of the No. 1 systems sloweddown around 2013, and it was the same for the sum performance ” [7], butthe authors extrapolated linearly, expecting that the development continuesand ”zettascale computing” (i.e., 10 -fold more than the present performance)shall be achieved in just more than a decade. On the other side, it has recentlybeen admitted that linearity is ” A trend that can’t go on ad infinitum. ” Fur-thermore, that it ” can be seen in our current situation where the historicalten-year cadence between the attainment of megaflops, teraflops, and petaflopshas not been the case for exaflops ”[8]. Officially, the TOP500 [41] evaluationsounds (as of 2020) that ” the top of the list remains largely unchanged ” and” the full list recorded the smallest number of new entries since the project beganin 1993 ”.The expectations against supercomputers are excessive. For example, asthe name of company PEZY witnesses, a billion times increase in payloadperformance is expected. It looks like that in the feasibility studies on super-computing using parallelized sequential systems an analysis whether buildingcomputers of such size is feasible (and reasonable) remained out of sight ei-ther in USA [9,10] or in EU [11] or in Japan [12] or in China [7]. The ”gold There are some doubts about the definition of exaFLOPS, whether it means nominalperformance R Peak or payload performance R Max , measured by which benchmark (or howit depends on the workload the computer runs) and finally using what operand length. Herethe term is used as R HPL − Max . To produce higher figures, several other benchmarks result,not related to floating computation, have been published. A special issue https://link.springer.com/journal/11714/19/10 https://en.wikipedia.org/wiki/PEZY Computing: The name PEZY is an acronym de-rived from the greek derived Metric prefixes Peta, Eta, Zetta, Yottahe many efficiencies of supercomputers 3 rush” is going on, even in the most prestigious journals [13,10]. In addition tothe previously existing ” two different efficiencies of supercomputers ” [14] fur-ther efficiency/performance values appeared (of course with higher numericfigures), and several more efficiencies can easily be derived.Although severe counter-arguments were also published, mainly based onthe power consumption of both single processors and large computing cen-ters [15], the moon-shot of limitless parallelized processing performance isfollowed. The probable source of the idea is the ”weak scaling” [16,3] . How-ever, it is based simply on a misinterpretation [17,18] of terms in Amdahl’slaw [55] . In reality, Amdahl’s Law (in its original spirit) is valid for all paral-lelized sequential activities, including computing-unrelated ones, and it is thegoverning law of distributed (including super-) computing .Demonstrative failures of some systems (such as supercomputers Gyoukouand Aurora’18 , and brain simulator SpiNNaker ) are already known, andmany more are expected to follow: such as Aurora’21 [22], the mystic Chinasupercomputers and the EU planned supercomputers . Fugaku, although itconsiderably enhanced the efficacy of computing, mainly due to clever placingand use mode of its on-chip memory, also stalled at about 40% of its plannedcapacity [23] and could increase its payload performance only marginally in ahalf year.Similar is the case with exascale applications, such as brain simulation. Ex-aggerated news about simulating brain of some animals or a large percentageof the human brain appeared. The reality is that the many-thread implemen-tation of brain simulator can fill an extremely large amount of memory withdata of billions of artificial neurons [24], a purpose-built (mainly hardware(HW)) brain simulator can be designed to simulate one billion neurons [25].In practice, however, they both can simulate only about 80 thousand neu-rons [26], mainly because of ”the quantal nature of the computing time” [27] .”More Is Different” [28].The confusion is growing. The paper attempts to clear up the terms byscrutinizing the basic terms, contributions, measurement methods. In section 2a by intention strongly simplified non-technical model, based on the tempo-ral behavior of physical implementation of computing [19] , is presented. Thenotations for Amdahl’s Law, which form the basis of the present paper, are The related work and speedup deserved the Gordon Bell Prize As explicitly suspected in [18]:
Gustafson’s formulation gives an illusion that as if Ncan increase indefinitely . It was also learned that specific processor design is needed for exascale
As part of theannouncement the development line
Knights Hill [20] was canceled and instead be replacedby a ”new platform and new micro-architecture specifically designed for exascale” Despite its failure, SpiNNaker2 is also under construction [21] https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156 J´anos V´egh Proc Proc Res Res − . . xyt ( Sunway/Taihulight ) α = 1 − · − α = 1 − . · − Total = 10 clocks N cores = 10 N cores = 10 R Max R Peak = N · (1 − α )+ α = · − +1 = 0 . R Max R Peak = N · (1 − α )+ α = 0 . Proc T i m e ( n o t p r o p o r t i o n a l ) α = PayloadTotal P P P P P Access
Initiation
Software
Pre OS Pre T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D Just waitingJust waiting OS Post
Software
Post
Access
Termination P a y l o a d T o t a l E x t e n d e d Fig. 1
Left: The time diagram of parallelized sequential operation in time-space [19] . Right:A conceptual, simplified model [55] of parallelized sequential computing operations, basedon their temporal behavior. Notice the different nature of those contributions, and that theyhave only one common feature: they all consume time . introduced in section 3. It is shown that the degradation of the efficiency ofparallelized sequential systems, as was suspected early [3] , is a natural conse-quence of the computing paradigm, rather than an engineering imperfectness(in the sense that it can be fixed later). Furthermore, its consequence is thatthe parallelized sequential computing systems by their very nature have anupper payload performance bound. Different contributions form the sequentialportion of the task (and through this, they degrade its parallelized perfor-mance), as detailed in section 4. The established model model is validated insection 5.Given that the race to produce computing systems, having componentsand systems with higher performance numbers, is going on, in section 6, theexpectable results of developments in near future are predicted. The sectionintroduces some further performance merits, and, through interpreting them,concludes that increasing the size of supercomputers further, and making ex-pensive enhancements in their technologies, only increase their non-payloadperformance . In section 7 is discussed that under extreme conditions, techni-cal objects of computing show up a series of behavior (for more details see [4] ),similar to that of natural objects. The performance measurements are simple time measurements (althoughthey need careful handling and proper interpretation, see good textbooks suchas [29]): a standardized set of machine instructions is executed (a large num-ber of times) and the known number of relevant operations is divided by themeasurement time; for both single-processor and distributed parallelized se- Sometimes also secondary merits, such as GFlops/Watt or GFlops/USD are also derivedhe many efficiencies of supercomputers 5 quential systems. In the latter case, however, the joint work must also beorganized, implemented with additional machine instructions and additionalexecution time, forming an overhead . This additional activity is the originof efficiency: one of the processors orchestrates the joint operation, the othersare idle waiting. At this point, the ” dark performance ” appears: the processingunits are ready to operate, consume power, but do not make any payload work.As discussed in detail in [19] , the ”stealthy nature” of the incremental devel-opment of technology made its appearance unnoticed. However, today, the”idle time” is the primary reason that power consumption is used mostly fordelaying electronic signals [30] inside our computing systems, and deliveringdata rather than performing computations [15].Amdahl listed [2] different reasons, why losses in ”computational load” canoccur. Amdahl’s idea enables us to put everything that cannot be parallelized,i.e., distributed between fellow processing units, into the sequential-only frac-tion of the task . For describing the parallelized operation of sequentially work-ing units, the model depicted in Figure 1 was prepared (based on the temporalbehavior of components, as described in [19] ). The technical implementationsof the different parallelization methods show up a virtually infinite variety [31],so here a (by intention) strongly simplified model is presented. However, themodel is general enough to discuss some examples of parallelly working sys-tems qualitatively. We shall neglect different contributions as possible in thedifferent cases. Our model can easily be converted to a technical (quantitative)one via interpreting its contributions in technical terms, although with someobvious limitations. Such technical interpretations also enable us to find outsome technical limiting factors of the performance of parallelized computing.The non-parallelizable (i.e. apparently sequential) part of tasks comprisescontributions from HW, operating system (OS), software (SW) and Prop-agation Delay (PD), and also some access time is needed for reaching theparallelized system. This separation is rather conceptual than strict, althoughdedicated measurements can reveal their role, at least approximately. Somefeatures can be implemented in either SW or HW, or shared between them.Furthermore, some sequential activities may happen partly parallel with eachother. Relative weights of these different contributions are very different fordifferent parallelized systems, and even within those cases depend on many spe-cific factors. That means, in every single parallelization case, a careful analysisis required. SW activity represents what was assumed by Amdahl as the totalsequential fraction . Non-determinism of modern HW systems [32] [33] alsocontributes to the non-parallelizable portion of the task: the resulting execu-tion time of parallelly working processing elements is defined by the slowestunit. This aspect is neglected in the weak scaling approximation Although some OS activity was surely included, Amdahl concluded some 20 % SW frac-tion, so at that time the other contributions could be neglectedapart from SW contribution.As shown in Figure 1 and discussed below, for today, this contribution became by severalorders of magnitude smaller. However, at the same time the number of the cores grew severalorders of magnitude larger. J´anos V´egh
Notice that our model assumes no interaction between processes runningon the parallelized system, in addition to the necessary minimum: startingand terminating otherwise independent processes, which take parameters atthe beginning and return their result at the end. It can, however, be triv-ially extended to more general cases when processes must share some resource(such as a database, which shall provide different records for the different pro-cesses), either implicitly or explicitly. Concurrent objects have their inherentsequentiality [34]. Synchronization and communication between those objectsconsiderably increase [35] the non-parallelizable portion (i.e. contribution to(1 − α SWeff ) or (1 − α OSeff )) of the task. Because of this effect, in the case of anextremely large number of processors, special attention must be devoted totheir role on efficiency of applications on parallelized systems.The physical size of the computing system also matters. A processor, con-nected to the first one with a cable of length of dozens of meters, must spendseveral hundred clock cycles with waiting. This waiting is only because ofthe finite speed of the propagation of light, topped by the latency time andhoppings of their interconnection (not mentioning geographically distributedcomputer systems, such as some clouds, connected through general-purposenetworks). Detailed calculations are given in [36] .After reaching a certain number of processors, there is no more increasein the payload fraction when adding more processors. The first fellow pro-cessor already finished its task and is idle waiting, while the last one is stillidle waiting for its start command. This limiting number can be increased byorganizing the processors into clusters: the first computer must speak directly only to the head of the cluster. Another way is to distribute the job near tothe processing units. It can happen either inside the processor [37,23], or onecan let to do the job by the processing units of a Graphic Processing Unit(GPU) .This looping contribution is not considerable (and, in this way, not notice-able) at a low number of processing units (apart form the other contributu-ions), but can be a dominating factor at a high number of processing units.This ”high number” was a few dozens at the time of writing the paper [3],today it is a few millions . The method, how the effect of the looping contri-bution is considered, is the border line between first and second-order approx-imations in modeling system’s payload performance. The housekeeping keepsgrowing with the number of processors. In contrast, system’s resulting per-formance does not increase anymore. The first-order approximation considersthe contribution of housekeeping as constant. The second-order approximationalso considers, that as the the number of processing units grows, housekeep-ing grows with, and gradually becomes the dominating factor of performancelimitation, and leads to a decrease in payload performance. Notice, however, that any additional actor on the scene increases the latency of compu-tation. Strongly depends on the workload and the architecture.he many efficiencies of supercomputers 7
As Figure 1 shows, in parallelized operating mode (in addition to compu-tation, furthermore communication of data between its processing units) bothsoftware and hardware contribute to the execution time , i.e. they both mustbe considered in Amdahl’s Law. This is not new, again: see [2]. Figure 1 alsoshows where is place to improve computing efficiency. When combining PDproperly with sequential scheduling, non-payload time can be considerably re-duced during fine-tuning the system (see the cases of performance increasesof
Sierra and
Summit , a half year after their appearance on the TOP500list). Also, mismatching total time and extended measurement time (or notmaking a proper correction) may lead to completely wrong conclusions [38] asdiscussed in [36] . Usually, Amdahl’s law is expressed as S − = (1 − α ) + α/N (1)where N is the number of parallelized code fragments (or Processing Unit(PU)s), α is the ratio of the parallelizable portion to the total, S is the mea-surable speedup. From this α = NN − S − S (2)When calculating the speedup, one actually calculates S = (1 − α ) + α (1 − α ) + α/N = NN · (1 − α ) + α (3)hence the resulting efficiency of the system (see Figure 2) E ( N, α ) = SN = 1 N · (1 − α ) + α = R Max R P eak (4)This phenomenon itself is known since decades [3], and α is theoreticallyestablished [39]. Presently, however, the theory was somewhat faded, mainlydue to the quick development of parallelization technology and the increase ofsingle-processor performance.During the past quarter of a century, the proportion of contributions changedconsiderably: today the number of processors is thousands-fold higher than itwas a quarter of a century ago. The growing physical size and the higher pro-cessing speed increased the role of the propagation overhead, furthermore, thelarge number of processing units strongly amplified the role of the loopingoverhead. As a side-result of the technological development, the phenomenonon performance limitation returned in a technically different form, at muchhigher number of processors . J´anos V´egh − − − − − − − − − − N o o f p r o c e s s o r s N o n − p a y l o a d / p a y l o a d E ff i c i e n c y TOP500’2020.11FugakuSummitSierraTaihulightSeleneJewelsK computer ≈ Brain
Fig. 2
The 2-parameter efficiency surface (in function of parallelization efficiency and num-ber of the processing elements), as concluded from Amdahl’s Law (see Eq. (4)), in first orderapproximation. Some sample efficiency values for some selected supercomputers are shown,measured with benchmarks HPL and HPCG, respectively. ”
This decay in performance isnot a fault of the architecture, but is dictated by the limited parallelism ” [3]
Through using Eq. (4), E = SN = R Max R Peak can be equally good for describingthe efficiency of parallelization of a setup: α E,N = E · N − E · ( N −
1) (5)As we discuss below, except for extremely high number of processors, it canbe safely assumed that the value of α is independent from the number ofprocessors in the system. Eq. (5) can be used to derive the value of α fromvalues of parameters R Max /R P eak and the number of cores N . According to Eq. (4), the payload efficiency can be described with a 2-dimensional surface , as shown in Figure 2. On the surface, some measuredefficiencies of the present top supercomputers are also depicted, just to illus-trate some general rules. Both the HPL and the HPCG efficiency valuesare displayed. The measured values are projected back to the axes to enable us to compare the corresponding values of their processors and parallelizationefficiency.The measured values can be separated in two groups. The recent trend isthat only a small fraction of their cores are used in the HPCG benchmark-ing, while of course all cores are used in the HPL benchmarking. As discussedabove, their efficiency of course depends on their workload, so the two bench-marked efficiency values can be projected to different number of cores anddifferent non-payload to payload ratios on the axis. The other group comprisesmeasurements where the same number of cores were used in both benchmarks.For this group, for visibility, only the HPL projections are displayed.In this latter group, it can be seen immediately, that as the the efficiencysharply decreases with the number of the cores in the system. In the formergroup, only about 10 % of the total cores is used, and the two efficiency valuesdiffer by about an order of magnitude. The general experience showed thatthe ratio of HPL-to-HPCG efficiency is about 200-500, when using the samenumber of cores. This is why these entries reduced their number of cores in thesecond benchmark: the payload performance reached its ”roofline” [40] [55] level at that number of cores; using all cores would decrease the systems per-formance by an order of magnitude only because of the higher number ofcores.There is an inflexion point in the performance: ” there comes a point whenusing more PUs . . . actually increases the execution time rather than reducingit ” [3]. As can be concluded from the figure, increasing the systems’s nominalperformance by an order of magnitude, at the same time, decreases its effi-ciency (and so: its payload performance ) by more than an order of magnitude.For HP CG benchmark: the ”roofline” [40] of that non-payload-to-payload in-tensity was already reached, all computers have about the same efficiency, atthe same number of PUs .Noticeable that in Fig. 2 the systems having the best efficiency values donot use accelerator: the efficiency of systems using accelerators is much loweralso in the case of HPL benchmark, but it is more disadvantageous in thecase of HPCG benchmark. As can be seen, the ”roof-line” efficiency can bereached with lower number of cores. In other words, the accelerators enablethe systems to reach much worse non-payload to payload ratio.A direct proof of our statement is the rough estimation, that – based onour Equ. (4)– if we divide the HPCG efficiency of the supercomputers inthe first group by the ratio of their cores used in the benchmark, we arriveabout the same efficiency value in the case of all supercomputers. Commentssuch as ”
The HPCG performance at only 0.3% of peak performance showsthe weakness of the Sunway TaihuLight architecture with slow memory andmodest interconnect performance ” [44] and ”
The HPCG performance at only2.5% of peak performance shows the strength of the memory architecture per-formance ” [23] show that supercomputing experts did not realize, that theefficiency is a(n at least) 2-parameter function, depending on both the numberof PUs in the system and its workload. That dependency defines the achiev-able payload performance. It looks like the community experienced the effect
HPL P r o ce ss o r p e r f o r m a n ce ( G fl o p / s ) AcceleratedNon-acceleratedGPU-acceleratedRegression of acceleratedRegression of nonacceleratedRegression of GPU accelerated 0 10 20 30 40 5010 − − − − − Ranking by
HPL ( − α H P L e ff ) AcceleratedNon-acceleratedGPU-acceleratedRegression of acceleratedRegression of nonacceleratedRegression of GPU accelerated
Fig. 3
Correlation of performance of processors using accelerator and effective parallelismwith ranking, in 2017. of the two-dimensional efficiency, but does not want to comprehend its reason,despite the early and clear prediction: ” this decay in performance is not a faultof the architecture, but is dictated by the limited parallelism ” [3]. In excessivesystems of modern HW, it is also dictated by the laws of nature [4] . Further-more, its dependence can be perfectly described by the properly interpretedAmdahl’s Law, rather than being ”empirical efficiency”.As suggested by Equ. (6), the trivial way to increase the resulting pay-load performance of a supercomputer is to increase the single-processor per-formance of its processors. Given that the single processor performance hasreached its limitations, some kind of accelerators are frequently used for thisgoal.Fig. 3 shows how utilizing accelerators influences the ranking of supercom-puters. The left subfigure shows, that the ranking of a supercomputer doesnot depend on which method of acceleration it uses. Essentially the same isconfirmed by the right subfigure: the effective parallelization raises with theranking position, and the slope is the same for any kind of acceleration. Asthe left subfigure depicts, GPU-accelerated processors really increase the pay-load performance of the system by a factor of 2-2.5. However, this increasedperformance is about 40 times lower than the expectation based on nomi-nal performance of GPU accelerator. The right subfigure, however, discovers,that the effective parallelization (the non-payload to payload ratio) of GPU-accelerated systems is nearly an order of magnitude worse than that of thenon-accelerated systems, i.e. the resulting efficiency can be (depending on thesize of the system) worse than in the case of utilizing unaccelerated processors;this can be a definite disadvance when GPUs used in a system with extremelylarge number of processors.Noticably, the two new items in Fig. 2 (Selene and Jewels; based on entirelyGPUs), not only show the worst efficacy for HPL benchmark, but their HPCGefficiencies push back the number of usable cores (for real-life tasks) well belowthe hundred thousand limit. On one side, the achieved HPCG efficiency valuesshow the same tendency as the HPL efficiency values: the more cores, the he many efficiencies of supercomputers 11 lower efficiency. On the other side, the figure shows that for real-life tasks thereasonable size of supercomputers is below 1M core for non-accelerated cores,and well below 0.1M cores for accelerated ones. The efficiency is just a fewpercent, and increasing the number of cores decreases their efficiency (doesNOT increase their payload performance).The reason is that GPUs and PUs have separated address spaces, anddata shall be moved from one address space to the other, that needs timeand increases the latency of the processor. Turning the memory into (partly)active element, with using different ’coherence’ solutions such as OpenCAPI,can mitigate this effect. See also section 5.In the HPL class, the communication intensity is the lowest possible one:computing units receive their task (and parameters) at the beginning of thecomputation, and they return their result at the very end. That is, the coreorchestrating their work must deal with the fellow cores only in these periods,so their communication intensity is proportional to the number of cores in thesystem. Notice the need to queue requests at the beginning and at the end ofthe task.In the HPCG class, iteration takes place: cores return the result of oneiteration to the coordinator core, which makes sequential operations: not onlyreceives and re-sends parameters, but also needs to compute new parametersbefore sending them to the fellow cores. The program repeats the processseveral times. As a consequence, the non-parallelizable fraction of the bench-marking time grows proportionally to the number of iterations and the size ofthe problem . The effect of that extra communication decreases the achievableperformance roofline [40]: as shown in Fig. 4, the HPCG roofline is about 200times smaller than the HPL one, as discussed in section 3. It is remarkablethat the performance gain roofline for HPL and the one for HPCG differ onlymarginally. Their difference can be attributed to the extra non-payload con-tribution of benchmark HPCG. Notice that the performance gain values forHPCG are measured with using only a fraction of the available cores; the realperformance gain should be an order of magnitude lower. With that correction,a sharp breakdown of the performance gain can be observed, as theoreticallypredicted [55] . Only three of the ten top supercomputers used all of their cores when run-ning HPCG , while, of course, they used all their cores when running benchmarkHPL. Comments such as ”
The HPCG performance at only 0.3% of peak per-formance shows the weakness of the Sunway TaihuLight architecture with slowmemory and modest interconnect performance ” [44] and ”
The HPCG perfor-mance at only 2.5% of peak performance shows the strength of the memoryarchitecture performance ” [23] show that supercomputing experts did not re-alize, that the efficiency is a(n at least) 2-parameter function number of thenumber of PUs in the system, and that the workload defines the achievablepayload performance.Notice two more rooflines. During the development of the interconnec-tion technology, between years 2004 and 2012, different implementation ideashave been around, and they competed for years. The beginning of the second Year P e r f o r m a n c e g a i n st by R HP LMax nd by R HP LMax rd by R HP LMax
Best by α
HP Leff st by R HP CGMax nd by R HP CGMax rd by R HP CGMax
Best by R
HP CGMax
Brain simulation
Fig. 4
The performance gain of top supercomputers, in function of their year of production.The marks display the measured values derived using HPL and HPCG benchmarks, for theTOP3 supercomputers. The small black dots mark the performance data of supercomputers
JUQUEEN and K as of 2014 June, for HPL and HPCG benchmarks, respectively. Thebig black dot denotes the performance value of the system used by [26]. The red pentagonsdenote performance gain, measured using half-precision operands. The saturation effect canbe observed for both HPL and HPCG benchmarks. roofline, around year 2011, coincides with the dawn of GPUs, interfering withthe effect of the interconnection technology.The top roofline dawned with the appearance of T aihulight : some assistantprocessors take over part of the duties of the individual cores, and in this waythe non-payload portion of the workload can be mitigated.It is important to notice the two red pentagons in the figure: they representthe performance gain achieved using half-precision operand length. The per-formance gain is lower than the double precision equivalent, just because of theincreased relative weight of housekeeping, as discussed in detail in section 4.4.The projections of efficiency values to axes show that the top few supercom-puters show up very similar parallelization efficiency and core number values:they are both required to receive one of the top slots; see also Figure ?? .Supercomputers T aihulight and
F ugaku are exceptions on both axes. Theyhave the highest number of cores and the best
HP L parallelization efficiency.An interesting coincidence is, that processor of both supercomputers have ”as- he many efficiencies of supercomputers 13 sistant cores” (i.e., some cores do not make payload computing, instead theytake over ”housekeeping duties”). This solution decreases the internal latencyof processors making payload computing and increases the efficiency of the sys-tem. They both use a ”light-weight operating system” (and so does
F ugaku and
Sierra ; four out of the first four), also to reduce processor latency. Thisefficiency, of course, requires executing several floating instructions per clockcycle. That mode of operation gets more and more challenging for the inter-connection, delivering data to and from data processing units. Notice also intheir cases the role of ”near” memories: as explained in [19] , the data deliverytime considerably increases the ”idle time” of computing. This idle time iswhy
F ugaku , with its cleverly placed L cache memories, can be more effec-tive when measured with HPL. This trick, however, is not working in the caseof HP CG , because its ”sparse” computations use those cache memories inef-fectively. The ”true”
HP CG efficiency of
F ugaku is expected to be betweenthe corresponding values of
Summit and
T aihulight .In addition, processors of
T aihulight comprise cooperating cores [37]. Thedirect core-to core transfer uses a (slightly) different computing paradigm: theprocessor cores explicitly assume the presence of another core, and in this way,their effective parallelism becomes much better, see also Fig. 6. On that figure,this data and the ones using shorter operands (
Summit and
F ugaku ), resultin effective parallelization values below the limiting line. Reducing the loopcount by internal clustering (in addition to the ”hidden clustering”, enabledby its assistant cores) and exchanging data without using global memory, how-ever, works only for the
HP L case, where the contribution of SW is low. Thepoor value of (1 − α HP CGeff ) is not necessarily a sign of architectural weak-ness [9]:
T aihulight comprises about four times more cores than
Summit andperforms the HPCG benchmark with ten-fold more cores. Given that HPCGmimics ”real-life” applications, one can conclude that for practical purposesonly systems comprising a few hundred thousand cores shall be built. Morecores contribute only to the ”dark performance”.According to Eq. (4), efficiency can be interpreted in terms of α and N , andthe efficiency of a parallelized sequential computing system can be calculatedas P ( N, α ) = N · P single N · (1 − α ) + α (6)This simple formula explains why the payload performance is not a linearfunction of the nominal performance and why in the case of very good paral-lelization ((1 − α ) (cid:28)
1) and low N , this nonlinearity cannot be noticed.The value of α , however, can hardly be calculated for the present complexHW/SW systems from their technical data (for a detailed discussion see [55] ).Two ways can be followed to estimate their value of α . One way is to calcu-late α for existing supercomputing systems (making ”computational experi-ments”[18]) applying Eq. (5) to data in TOP500 list [4] . This way provides a Assuming good interconnection. For computing systems connected with general-purposenetworks, such as some high-performance clouds, this limiting number is much less4 J´anos V´egh lower bound for (1 − α ), which is already achieved. Another way round is toconsider contributions of different origin, see section 4, and to calculate thehigh limit of the value of (1 − α ), that the given contributions alone do not en-able to exceed (provided that that contribution is the dominant one). It givesus good confidence in the reliability of the parameters that values derived inthese two ways differ only within a factor of two. At the same time, this alsomeans that technology is already very close to its theoretical limitations.Notice that the ”algorithmic effects” – such as dealing with ”sparse” datastructures (which affects cache behavior, that will have a growing importancein the age of ”real-time everything” and neural networks) or communicationbetween parallelly running threads, such as returning results repeatedly to themain thread in an iteration (which greatly increases non-parallelizable fractionin the main thread) – manifest through the HW/SW architecture, and theycan hardly be separated. Also notice, that there are one-time and fixed-sizecontributions, such as utilizing time measurement facilities or calling systemservices. Since α eff is a relative merit, the absolute measurement time shall belarge. When utilizing efficiency data from measurements, which were dedicatedto some other goal, proper caution must be exercised with the interpretationand accuracy of those data.The ’right efficiency metric’ [42] has always been a question (for a summarysee cited references in [43]) when defining efficient supercomputing. The goalof the present discussion is to find out the inherent limitations of parallelizedsequential computing, and to provide numerical values for it. For this goal,the ’classic interpretation’ [2,3,39] of performance was used, in its originalspirit. Contributions mentioned in those papers were scrutinized and theirimportance under current technical conditions revised.The left subfigure of Figure ?? shows, that to get better ranking on theTOP500 list, a higher number of processors is required. The regression line isdifferent for TOP10 and the TOP50 positions. The cut line between ”racing”and ”commodity” supercomputers is around slot 10. As the right subfigure un-derpins, high number of processors must be accompanied with good paralleliza-tion efficiency , otherwise, the large number of cores cannot counterbalance thedecreasing efficiency, see Eq. (6). α Theory can display data from systems with any contributors with any pa-rameters, but from measured data, only the sum of all contributions can beconcluded, although dedicated measurements can reveal value of separatedcontributions experimentally. The publicly available data enable us to drawonly conclusions of limited validity. he many efficiencies of supercomputers 15 α The estimations below assume that the actual contribution is the dominatingone, and as such, defines the achievable performance alone. This situation isusually not the case in practice, but this approach enables us to find out thelimiting (1 − α ) values for all contributions.In the systems implemented in Single Processor Approach (SPA) [2] as par-allelized sequential systems, the life begins in one such sequential subsystem,see also Fig. 1. In large parallelized applications, running on general-purposesupercomputers, initially and finally only one thread exists, i.e., the minimalnecessary non-parallelizable activity is to fork the other threads and join themagain.With the present technology, no such actions can be shorter than one pro-cessor clock period . That is, the theoretical absolute minimum value of thenon-parallelizable portion of the task will be given as the ratio of the time ofthe two clock periods to the total execution time. The latter time is a freeparameter in describing efficiency. That is, the value of effective paralleliza-tion α eff depends on the total benchmarking time (and so does the achievableparallelization gain, too).This dependence is, of course, well known for supercomputer scientists:for measuring the efficiency with better accuracy (and also for producing bet-ter α eff values), hours of execution times are used in practice. In the caseof benchmarking supercomputer T aihulight [44], 13,298 seconds HPL bench-mark runtime was used; on the 1.45 GHz processors it means 2 ∗ clockperiods. The inherent limit of (1 − α eff ) at such benchmarking time is 10 − (or, equivalently, the achievable performance gain is 10 ). In the paper, forsimplicity, 1.00 GHz processors (i.e., 1 ns clock cycle time) will be assumed.Supercomputers are also distributed systems. In a stadium-sized super-computer, a distance between processors (cable length) up to about 100 mcan be assumed. The net signal round trip time is ca. 10 − seconds, or 10 clock periods, i.e., in the case of a finite-sized supercomputer performance gaincannot be above 10 , only because of the physical size of the supercomputer.The presently available network interfaces have 100. . . 200 ns latency times,and sending a message between processors takes time in the same order ofmagnitude, typically 500 ns. This timing also means, that making better in-terconnection is not a bottleneck in enhancing performance . This statement isalso underpinned by the discussion in section 4.3.These predictions enable us to assume,that the presently achieved value of(1 − α eff ) could also persist for roughly a hundred times more cores. However,another major issue arises from the computing principle SPA: only one pro-cessor at a time can be addressed by the first one. As a consequence, at least This statement is valid even if some parallelly working units can execute more thanone instruction in a clock period. One can take these two clock periods as an ideal (butnot realistic) case. However, the actual limitation will inevitably be (much) worse than theone calculated for this idealistic case. The exact number of clock periods depends on manyfactors, as discussed below.6 J´anos V´egh as many clock cycles are to be used for organizing the parallelized work asmany addressing steps required. This number equals to the number of cores insupercomputer, i.e., the addressing operation in supercomputers in the TOP10positions typically needs clock cycles in the order of 5 ∗ . . . 10 ; degradingthe value of (1 − α eff ) into the range 10 − . . . 2 ∗ − . Two tricks may be usedto mitigate the number of the addressing steps: either cores are organized into cluster s as many supercomputer builders do, or at the other end, the processoritself can take over the responsibility of addressing its cores [37]. In functionof the actual construction, the reducing factor of clustering of those types canbe in the range 10 . . . 5 ∗ , i.e the resulting value of (1 − α eff ) is expectedto be around 10 − .An operating system must also be used for protection and convenience.If fork/join is executed by the OS as usual, because of the needed contextswitchings 2 ∗ [45,46] clock cycles are used, rather than 2 clock cyclesconsidered in the idealistic case. The derived values are correspondingly by fourorders of magnitude different , that is, the absolute limit is ≈ ∗ − , on a zero-sized supercomputer. This value is somewhat better than the limiting valuederived above, but it is close to that value and surely represents a considerablecontribution. This limitation is why a growing number of supercomputers uses”ligh-weight kernel” or runs its actual computations in kernel mode; a methodof computing that can be used only with well-known benchmarks.However, this optimistic limit assumes that an instruction can be accessedin one clock cycle. It is usually not the case, but it seems to be a good approxi-mation. On one side, even a cached instruction in the memory needs about fivetimes more access time, and the time required to access ’far’ memory is roughly100 times longer. Correspondingly, the most optimistic achievable performancegain values shall be scaled down by a factor of 5 . . . α HP Leff and α HP CGeff can be attributed toa different cache behavior because of the ’sparse’ matrix operations.4.2 The effect of workflowThe overly complex Figure 5 attempts to explain the phenomenon, why andhow the performance of a supercomputer configuration depends on the ap-plication it runs. The non-parallelizable fraction (denoted on the figure by α Xeff ) of the computing task comprises components X of different origin. As al-ready discussed, and was noticed decades ago, ” the inherent communication-to-computation ratio in a parallel application is one of the important determinantsof its performance on any architecture ” [3], suggesting that communicationcan be a dominant contribution in system’s performance . Figure 5.A displaysa case with minimum communication, and Figure 5.B a case with moderatelyincreased communication (corresponding to real-life supercomputer tasks). Asthe nominal performance increases linearly and the payload performance de-creases inversely with the number of cores, at some critical value, where aninflection point occurs, the resulting payload performance starts to drop. The he many efficiencies of supercomputers 17 x a a a a m y Input Layer ”HPL Layer” Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α H P L e ff ) − − − − − R H P L M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) A x a a a a m y y n y n n n n m Input Layer ”HPCG Layer 1” ”HPCG Layer n” Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α H P C G e ff ) − − − − − R H P C G M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) B x x x x n a a a a m n n n n m y y y y k Input Layer Hidden Layer 1 Hidden Layer n Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α NN e ff ) − − − − − R NN M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) C Fig. 5
The figure explains how the different communication/computation intensities of ap-plications lead to different payload performance values in the same supercomputer system.Left column: models of the computing intensities for different benchmarks. Right column:the corresponding payload performances and α contributions in function of the nominal per-formance of a fictive supercomputer ( P = 1 Gflop/s @ 1
GHz ). The blue diagram line refersto the right hand scale ( R Max values), all others ((1 − α Xeff ) contributions) to the left handscale. The figure is purely illustrating the concepts; the displayed numbers are somewhatsimilar to real ones. The performance breakdown shown in the figures were experimentallymeasured by [3], [49](Figure 7) and [26](Figure 8). resulting non-parallelizable fraction sharply decreases efficacy (or, in otherwords: performance gain or speedup) of the system [47, 48] . The effect wasnoticed early [3], under different technical conditions, but somewhat faded dueto successes of development of parallelization technology.Figure 5.A illustrates the behavior measured with HPL benchmark. Thelooping contribution becomes remarkable around 0.1 Eflops, and breaks downpayload performance (see also Figure 1 in [3]) when approaching 1 Eflops. InFigure 5.B, the behavior measured with benchmark HPCG is displayed. Inthis case, the contribution of the application (brown line) is much higher, thein the previous case. The looping contribution (thin green line) is the same as above. Consequently, the achievable payload performance is lower, and alsothe breakdown of the payload performance is softer in the case of real-lifetasks.Given that no dedicated measurements exist, it is hard to make a directcomparison between theoretical prediction and measured data. However, theimpressive and quick development of interconnecting technologies provides ahelping hand.4.3 Contribution of the interconnectionAs discussed above, in a somewhat simplified view, the resulting performancecan be calculated using the contributions to α as P ( N, α ) = N · P single N · (1 − α Net − α Compute − α Others ) + ≈ difference of measured α can be directlycompared to difference of the corresponding sum of calculated α values, al-though here only qualitative agreement can be expected.Both quality of the interconnection and nominal performance are a para-metric function of their time of construction, so one can assume on the theoryside that (in a limited period), interconnection contribution was changing infunction of nominal performance as shown in Figure 6A. The other majorcontribution is assumed to be computation itself. Benchmark computationcontributions for HPL and HPCG are very different, so the sum of the respec-tive component, plus the interconnection component are also very different.Given that at the beginning of the considered period, the contribution fromHPCG calculation and that of interconnection were in the same order of mag-nitude, their sum only changed marginally (see upper diagram lines), i.e., themeasured performance improved only slightly. Because benchmark HPCG iscommunication-bound (and so are real-life programs), their efficiency would bean order of magnitude worse. The reason is Eq. (4): when supercomputers useall of their cores, their achievable performance is not higher (or maybe evenlower), only their power consumption is higher (and the calculated efficiencyis lower). As predicted: ” scaling thus put larger machines at an inherent disad-vantage ” [3]. The cloud-like supercomputers have a disadvantage in the HPCGcompetition: the Ethernet-like operation results in relatively high (1 − α ) val-ues. This time also accessing data (”accessing data” is included)he many efficiencies of supercomputers 19 − − − − − − − − − − R Peak ( Eflop/s ) ( − α ) α total and α interconnection , theory α Interconnect α HPL α HPCG α Interconnect + HPL αInterconnect + HPCG A − − − − − − − − − − R Peak ( Eflop/s ) ( − α ) α total and α interconnection , measured June 2009, HPLJune 2011, HPLJune 2013, HPLJune 2015, HPLJune 2017, HPLJune 2019, HPLJune 2019, HPLJune 2020, HPL-AIJune 2017, HPCGJune 2020, HPCG B Fig. 6
The effect of changing the dominating contribution. The left subfigure shows thetheoretical estimation, the right subfigure the corresponding measured data, as derived frompublic database TOP500 [50] (only values for the first four supercomputers are shown). Whenthe contribution from interconnection drops below that of computation, the value of (1 − α )(and the performance gain) get saturated. The red ’x’ marks denote half-precision values. The case with HPL calculation is drastically different (see the lower di-agram lines). Since in this case, at the beginning of the considered period,the contribution from interconnection is very much larger than that from thecomputation, the sum of these two contributions changes sensitively as thespeed of the interconnection improves. As soon as the contribution from inter-connection decreases to a value comparable with that from the computation,the decrease of the sum slows down considerably, and further improvement ofinterconnection causes only marginal decrease in the value of the resulting α (and so only a marginal increase in the payload performance).The measured data enable us to draw the same conclusion, but one mustconsider that here multiple parameters may have been changed. Their ten-dency, however, is surprisingly clear. Figure 6.B is actually a 2.5D diagram:the size of marks is proportional to the time passed since the beginning ofthe considered period. A decade ago, the speed of interconnection gave themajor contribution to α total . Enhancing it drastically in the past few years,increased efficacy. At the same time, because of the stalled single-processorperformance, other technology components only changed marginally. The com-putational contribution to α from benchmark HPL remained constant in func-tion of time, so the quick improvement of interconnection technology resultedin a quick decrease of α total , and the relative weights of α Net and α Compute reversed.
The decrease in value of (1 − α ) can be considered as the result ofdecreased contribution from interconnection. However, the total α contribution decreased considerably only until α Net reached the order of magnitude of α Compute . This match occurred in the first4-5 years of the period shown in Figure 6B: the sloping line is due to theenhancement of interconnection. Then, the two contributors changed their role, and the constant contribution due to computation started to dominate,i.e., the total α contribution decreased only marginally. As soon as computingcontribution took over the dominating role, the value of α total did not fall anymore: all measured data remained above that value of α . Correspondingly, thepayload performance improved only marginally (and due to factors other thaninterconnection).At this point, as a consequence of that the dominating contributor changed,it was noticed, that efficacy of benchmark HPL and efficacy of real-life appli-cations started to differ by up to two orders of magnitude. Because of this, anew benchmark program HPCG [51] was introduced, since ” HPCG is designedto exercise computational and data access patterns that more closely match adifferent and broad set of important applications ” [52].Since the major contributor is computing itself, the different benchmarkscontribute differently and since that time ” supercomputers have two differentefficiencies ” [14]. Yes, if the dominating α contribution (from the benchmarkcalculation) is different, then the same computer shows different efficienciesin function of the computation it runs. Since that time, the interconnectionprovides less contribution than the computation of the benchmark. Due tothat change, enhancing the interconnection contributes mainly to the darkperformance , rather than to the payload performance .4.4 The effect of reduced operand lengthThe so-called HPL-AI benchmark used Mixed Precision rather than DoublePrecision computations. The name suggests that AI applications may run onthe supercomputer with that efficiency. However, the type of workload doesnot change, so that one can expect the same overall behavior for AI applica-tions, including Artificial Neural Network (ANN)s, than for double-precisionoperands. For AI applications, limitations remain the same as described above;except that when using Mixed Precision, efficiency can be better by a factorof 2 · · ·
3, strongly depending on the workload of the system; see also Fig. 8.Unfortunately, when using half-precision, the enhancement comes from ac-cessing less data in memory and using faster operations on shorter operands,instead of reducing communication intensity , that defines efficiency . Simi-larly, exchanging data directly between processing units [37] (without usingglobal and even local memory) also enhances α (and payload performance) [54],but it represents a (slightly) different computing paradigm. Only the two men-tioned measured data fall below the limiting line of (1 − α ) in Figure 6.Recent supercomputers F ugaku [23] and
Summit [53] provided their HPLperformance for both 64-bit and 16-bit operands. Of course, their performance Both names are rather inconsequent. On one side, the test itself has not much to dowith AI, it just uses the operand length common in Artificial Intelligence (AI) tasks; (HPL,similarly to AI, is a workload type ). On the other side, Mixed Precision is actually HalfPrecision: for multiplication twice as long operands are used temporarily. It is a differentquestion is if those operations are contracted. On the contrary, the relative weight of communication is increased in this way.he many efficiencies of supercomputers 21 seems to be much better with shorter operand length (at the same numberof operations, the total measurement time is much shorter). It was expectedthat their performance should be four times higher when using four timesshorter operands. Power consumption data [53] underpin the expectations:power consumption is about four times lower for four times shorter operands.Computing performance, however, shows a slighter performance enhancement:3.01 for
Summit and 3.42 for
F ugaku , because of the needed housekeeping.In the long run, a
T ime X value comprises housekeeping and computation.We assume that housekeeping (indexing, branching) is the same fixed amountfor different operand lengths, and the other time contribution (data deliveryand bit manipulation) is proportional with the operand length. Given thataccording to our model, the measured payload performance directly reflectsthe sum of all contributions, we can assume that T ime = F + F T ime = F + 4 · F (8)where F is the time contribution from housekeeping (in a long-term run, usingbenchmark HPL), and F is the time contribution due to manipulating 16bits.We can use two simple models when calculating the relative values of F and F . Both models are speculative rather than technically established ones,but they nicely point to the important point: the role of housekeeping.If no part of housekeeping is parallelized with floating computing, further-more, computing and data transfer operations do not block each other, thesimple summing above can be used, see Fig. 7. With proper parallelizationmethod, the non-floating housekeeping for the next operand could be per-formed in parallel with the floating operation of the current operand, so thetheoretically possible ratio of four could be reached. Notice, however, that thismodel is not aware of the data transfer time.In the case of considering the temporal behavior of the components, thetrigonometric sum of the non-payload and payload times shall be used, see [19] . T ime = (cid:113) F + F T ime = (cid:113) F + (4 · F ) (9)Table 1 shows the data calculated from the published values in TOP500list for supercomputers F ugaku and
Summit , together with their parameterscalculated as described above, using the time-aware summing model. In thediscussed simple workload case there is no significant difference between thevalues derived using the two different models. The values in the figures aregiven in units of payload performance ratios, the values in the table are givenin units of the respective α contributions. The numbers can only be comparedas discussed below.The Ef f and Ef f are calculated from the corresponding published R Max /R P eak values. Amdahl’s parameter is calculated using Eq. (5), for the
F P F P F P · F P F P . F P
16 = 0 . Summit
F P . F P
16 = 0 . F P F P F P · F P Fugaku FP . F P
16 = 0 . F P FP FP · F P FP . FP
16 = 0 . F P F P F P · FP Fig. 7
Left: The serial summing of payload and non-payload contributions (assumes noparallelization and/or no blocking. Right: The parallel summing of payload and non-payloadcontributions (assumes parallelization and enables to consider mutual blocking)
Table 1
Floating point characteristics of supercomputers
F ugaku and
Summit
Name
Eff Eff (1 − α ) (1 − α ) T ime T ime F F Fugaku 0.808 0.691 3.25e-8 6.12e-8 3.25e-8 1.79e-8
Summit 0.74 0.557 14.7e-8 33.2e-8 14.7e-8 11.0e-8 two different operand lengths. As discussed, (1 − α ) is the sequential portionof the total measurement time (aka non-payload to payload ratio). Assumingthat the total measurement time is the time unit, the limiting time of per-forming a floating operation with 64-bit operands T ime (on our arbitrarytime scale) is directly derived from the (1 − α ) value. To get a T ime valueon the same scale, the measured (1 − α ) value must be corrected for theirdiffering measurement time (the measured performance ratios 3.42 and 3.01,respectively).The absolute values of data in the last two columns of the table shall not becompared directly with those of the other supercomputer (their measurementtime was different). Given that the task and the computing model were thesame, we can directly compare ratios of values F and F . The proportions ofboth F values and F /F values show that housekeeping is much better for F ugaku than for
Summit . Given that their architectures are globally similar,the plausible reason for the difference in their efficacy (and performance) isthat in the case of
Summit , the processor core is in a role of proxy (andin this way it represents a bottleneck), while
F ugaku uses ”assistant cores”.Housekeeping increases latency and significantly decreases performance of thesystem. The plausible reason for their better F values is the clever use (andpositioning! see [19] ) of L2-level cache memories.In the case of Summit , we also know its HPCG efficiency. In its casethe
T ime
HP CG value is 2 . · e −
5, i.e. several hundred times higher than
T ime
HP L . Given that F and F are the same in the case of the two bench-marks, the difference is caused by F . In the long run, the different workload he many efficiencies of supercomputers 23 (iteration, more intensive communication, ”sparse” computation forcing dif-ferent cache utilization) forces different F value, and that leads to ”differentefficiencies” [14] of supercomputers under different workloads .In the case of F ugaku , only a fraction of its cores was used in the bench-mark HPCG, so only the achieved payload performance (but not its efficiency)can be validated. It is very plausible that HPCG performance reached itsroofline, and (because of the higher number of cores), its real HPCG efficiencywould be around that of
T aihulight . Anyhow, it would not be fair either toassume that speculation or to accept their value measure at a different numberof cores. There is no measured data.These data directly underpin that technology is (almost) perfect: contri-bution from benchmark calculation
HPCG-FP64 is by orders of magnitudelarger than the contribution from all the rests. Recalling that that bench-mark program imitates the behavior (as defined by the resulting α ) of real-lifeprograms, one can see that the contribution from other computing-related ac-tors is about a thousand times smaller than the contribution of the computa-tion+communication .The unique role of ”mixed precision” efficiencies (a third kind of efficienciesof a supercomputer), see the red ’x’ marks in Fig. 6, deserves special attention.Strictly speaking, the points cannot be correctly positioned in the figure; theybelong to a different scale (they are measured on a different HW). On one side,the same number of operations are performed, using the same amount of PUs.On the other side, four times less data are transferred and manipulated. Thenominal performance is expected to be four times higher than in the case ofusing double-precision operands. Without correcting for the more than threetimes shorter measurement (see below), the efficiency mark is slightly abovethe corresponding value measured with double length operands (the relativeweight of F is higher), with correction, it is somewhat below it. This differ-ence, however, is noticeable only with benchmark HPL; in the case of HPCGworkload, computation (including operand length) has a marginal effect. In theformer case, contributions of computation and communication are in the samerange of magnitude, and they are competing for dominating the performanceof the system. In the latter case, communication dominates performance; com-puting has a marginal role. This is the reason why no data are available forthe HPCG benchmark using half-precision operands.Fig. 8 illustrates the role of non-payload performance with respect tooperand length. In the figure (for visibility) a hypotethic ratio 10 of efficienciesmeasured by benchmarks HP L and
HP CG was assumed. The non-payloadcontribution blocks the operation of the floating point unit. The dominatingrole of non-payload contribution also means, that it is of little importanceif double or half precision operands are used in the computation. The bluevectors essentially represent the case of
Summit , the red vector representsthe
F P value of F ugaku (transformed to scaling of
Summit ). The differencebetween their
HP L performances is attributed to their different
F P values. Recall here that cache behavior may be included4 J´anos V´egh
HPL and HPCG timingSummit and Fugaku implementation FP FP · FP FP non-payload FPHPCG FPHPCG Fig. 8
The role of non-payload contribution in defining
HP L and
HP CG efficiencies, fordouble and half precision floating operation modes. For visibility, a hypothetic efficiencyratio E HPL /E HPCG =10 assumed. The housekeeping (including transfer time and cachemisses) dominates, the length of operands has only marginal importance.
This difference, however, gets marginal as the workload approaches real-lifeconditions.4.5 Further efficiency valuesThe performance corresponding to α F P HP L is slightly above 1 EFlops (whenmaking no floating operations, i.e., rather Eops). Another peak performancereported when running genomics code on Summit (by using a mixture ofnumerical precisions and mostly non-floating point instructions) is 1.88 Eops,corresponding to α F P Genom = 1 ∗ − . Given that those two values refer to adifferent mixture of instructions, the agreement is more than satisfactory.Our simple calculations result in, that in the case of benchmark HPL , F P values are in the order of F P , and that benchmark HPL is computing bound:reducing housekeeping (including communication) has some sense for ”rac-ing supercomputers”, for real-life applications it has only marginal effect . Onthe other side, increasing the housekeeping (more communication such as inthe case of benchmark HPCG or ANNs) degrades the apparent performance.At sufficiently large amount of communication [55] , housekeeping dominatesthe performance, and contribution of F P X becomes marginal. For large ANNapplications, using F P operands makes no real difference, their workload defines their performance (and efficiency). The ”commodity supercomputers”can achieve the same payload performance, although they have much lowernumber of PUs: the ”racing supercomputers” either cannot use their impres-sively vast number of cores at all, or using all their cores does not increasetheir payload performance in solving real-life tasks. times worse (1 − α eff ) andperformance gain values. The dominant limiting factor, however, is a differentone.In brain simulation, a 1 ms integration time (essentially sampling time)is commonly used [26]. The biological time (when events happen) and thecomputing time (how much computing time is required to perform computingoperations to describe the same) are not only different, but also not directlyrelated. Working with ”signals from the future” must be excluded. For thisgoal, at the end of this period, the calculated new values of the state variablesmust be communicated to all (interested) fellow neurons. This action essen-tially introduces a ”biology clock signal period”, being a million-fold longerthan the electronic clock signal period. Correspondingly, the achievable per-formance is desperately low: less than 10 neurons can be simulated, out ofthe planned 10 [26] . For a detailed discussion see [27, 19, 55] .Figure 9 depicts the experimental equivalent of Figure 5.In [26], their power consumption efficiency was also investigated. It ispresumed that (to avoid obsolete energy consumption), they performed themeasurement at the point, where involving more cores increases the powerconsumption but does not increase payload simulation performance. This as-sumption resulted in the ”reasoned guess” for the efficiency of brain simulationin Figure 5. As using AI workload (for a discussion from this point of viewsee [57] ) on supercomputers is of growing importance, performance gain of a Despite its failure, the SpiNNaker2 is also under construction [21]6 J´anos V´egh Year P e r f o r m a n c e g a i n st by R HP LMax nd by R HP LMax rd by R HP LMax
Best by α
HP Leff st by R HP CGMax nd by R HP CGMax rd by R HP CGMax
Best by R
HP CGMax
Brain simulation
Fig. 9
Performance gains of supercomputers modeled as ”roofline” [40], as measured withbenchmarks HPL and HPCG (data taken from database TOP500 [41]), and the one for brainsimulation is concluded from [26]. The red pentagons denote performance gain, measuredusing half-precision operands. processor-based
AI application can be estimated to be between those of HPCGand brain simulation, closer to that of HPCG. As discussed experimentallyin [58] and theoretically in [57] , in the case of neural networks (especiallyin the case of selecting improper layering depth) the efficiency can be muchlower , than that estimated value.Recall, that since AI nodes usually perform simple calculations compared tofunctionality of supercomputer benchmarks, their communication/calculationratio is much higher, making the efficacy even worse. Our conclusions areunderpinned by experimental research [58]: – ”strong scaling is stalling after only a few dozen nodes” – ”The scalability stalls when the compute times drop below the communica-tion times, leaving compute units idle. Hence becoming an communicationbound problem.” artificial intelligence, . . . it’s the most disruptive workload from an I/O patternperspective ”he many efficiencies of supercomputers 27 − − − . . . N o o f c o r e s (1 − α eff ) E ff i c i e n c y Dependence of E HPL on (1 − α eff ) and N Piz Daint2012/112013/062013/112016/112017/062018/11 − − − − − − − − R Peak (exaFLOPS) R M a x ( e x a F L O P S ) Development of R HPLMax for
PizDaint
Supercomputer
Xeon E5-2690 + NVIDIA Tesla P100 (2018)Xeon E5-2690 + NVIDIA Tesla P100 (2017)Xeon E5-2690 + NVIDIA Tesla P100 (2016)Xeon E5-2670 + NVIDIA K20x (2013)Xeon E5-2670 (2013)Xeon E5-2670 (2012)
Fig. 10
The history of supercomputer Piz Daint in terms of efficiency and payload per-formance [50]. The left subfigure shows how the efficiency changed as developers proceededtowards higher performance. The right subfigure shows the reported performance data (thebubbles), together with diagram line calculated from the value as described above. Comparevalue of diagram line to measured performance data in the next reported stage. – ”network layout has a large impact on the crucial communication/computeratio: shallow networks with many neurons per layer . . . scale worse thandeep networks with less neurons.”The massively ”bursty” nature of data (different nodes of the layer wantto use the communication subsystem at the same moment) also makes thecase harder. The commonly used global bus is overloaded with messages (fora detailed discussion see [19] ), that may lead to a ”communicational collapse”(demonstrated in Figure 5.(a) in [59]): at an extremely large number of cores,exceeding the critical threshold of communication intensity, leads to unex-pected and drastic change of network latency. As the value of parameters of our model are inferred from non-dedicated single-shot measurements, their reliability is limited. One can verify, however, howour model predicts values derived from later measurements. Supercomputersusually do not have a long lifespan and several documented stages. One of therare exceptions is supercomputer
Piz Daint . Its documented lifetime spansover six years. At that period, different amounts of cores, without and withacceleration, using different accelerators, were used.Figure 10 depicts performance and efficiency values published during itslifetime, together with diagram lines predicting (at the time of making theprediction) values at higher nominal performance values. The left subfigureshows how changes made in the configuration affected its efficiency (the time-line starts in the top right corner, and a line connects adjacent stages).
In the right subfigure, bubbles represent data published in adjacent edi-tions of the TOP500 lists, the diagram lines crossing them are predictions madefrom that snapshot. The predicted value shall be compared to the value pub-lished in the next list. It is especially remarkable that introducing GPGPUacceleration resulted only in a slight increase (in good agreement with [60]and [15]) compared to the value expected based purely on the increase in thenumber of cores. Although between our ”samplings” more than one parame-ter was changed, that is, the net effect cannot be demonstrated clearly, themeasured data sufficiently underpin our limited validity conclusions and showthat theory correctly describes the tendency of development of performanceand efficiency, and even its predicted performance values are reasonably accu-rate.Introducing GPU accelerator is a one-time performance increasing step [15],and cannot be taken into account by theory. Notice, that introducing accel-erator, increased payload performance, but decreased efficiency (copying datafrom one address space to another one increases latency). Changing the accel-erator to another type with slightly higher performance (but higher latencydue to its larger GPGPU memory) caused a slight decrease in the absoluteperformance because of the considerably dropped efficiency.
As detailed above, our theoretical model enables us to calculate the payloadperformance in first-order approximation at any nominal performance value.In the light of all of this, one can estimate a short time and a longer timedevelopment of supercomputer performance, see Figs. 11.A and 11.B. The di-agram lines are calculated using Eq. (4), with α parameter values derived fromTOP500 data of Summit supercomputer; the bubbles show measured values.The diagram lines from the bottom up show the double floating precisionHPCG, HPL and the half precision [53] HPL (HPL-AI) diagrams. Given thatour parameter values are calculated from a snapshot, and that the calculationis essentially an extrapolation, furthermore that at high nominal performancevalues using the second-order approximation is more and more pressing, pre-dictions shown in Figure 11 are rough and very optimistic approximations,but somewhat similar to real upper limit values.In addition to measured and published performance data, two more dia-gram lines representing two more calculated α values are also depicted. The’FP0’ (orange) diagram line is calculated with the assumption that super-computer makes the stuff needed to perform the HPL benchmark, but actualFP operations are not performed. In other words, the computer works withzero-bit length floating operands (FP0) .The ’Science’ (red) diagram line is calculated with the assumption thatnothing is calculated, but science (the finite propagation time due to the fi- The role of α FP HPL is akin to the execution time of the ”empty loop” in programming.he many efficiencies of supercomputers 29 − − − − − − − − Nominal performance (EFlops) P a y l oa dp e r f o r m a n ce ( E F l o p s ) Payload performances @Summit alpha = HPCG − FP alpha = HPL − FP alpha = HPL − FP alpha = HPL − FP alpha = Science − − − − − − Nominal performance (EFlops) P a y l oa dp e r f o r m a n ce ( E F l o p s ) Payload performances @Exascale alpha = HPCG − FP alpha = HPL − FP alpha = HPL − FP alpha = HPL − FP ,Fugaku alpha = Science
Fig. 11
Tendency of development of payload performance, in the near and farther futureof supercomputing. The parameters of
Summit are used for illustration, and for comparisonthe double precision performance diagram line of
F ugaku is also displayed. The diagramlines are calculated from the theory, the marks are the measured values from the databaseTOP500 [50] and [53]. nite speed of light) limits payload performance . The nonlinearity of payloadperformance around the Eflops nominal performance is visible, and dependsboth on the amount of computing+communication and nominal performance(represented by the number of cores).Figure 11B shows the farther future (in first-order approximation): towardsZflops [7]. No surprise that all payload performance diagram lines run into sat-uration, even the ’FP0’ and ’Science’ ones. For comparison, double-precisionpayload performance of F ugaku is also displayed. Recall that the diagramlines are calculated in the first-order approximation. In the second-order ap-proximation, it is expected that diagram lines reach their inflection point andbreak down. These top supercomputers are near to that point; this is whytheir development stalled.
The analogies in this section do not want to imply direct correspondence be-tween certain physical and computing phenomena. Instead, we want to drawthe attention to both that under extreme conditions qualitatively different be-havior may be encountered, and that scrutinizing certain, formerly unnoticed orneglected aspects, enables us to explain the new phenomena . Unlike in nature,in computing, the technical implementation of critical points can be changed,and through this, the behavior of computing systems can also be altered. Nosystematic discussion is given here, for detailed discussion and more detailssee [4] .
100 m cable length was assumed, which means 10 processors pro cm and some GWdissipation.0 J´anos V´egh / (1 − α ); a well-known consequenceof Amdahl’s statement. Due to this, the computing performance of a sys-tem cannot be increased above the performance defined by single-processorperformance, parallelization technology, and number of processors . Laws ofnature prohibit to exceed a specific computing performance (using the clas-sical paradigm and its classical implementation). There is an analogy betweenadding speeds in physics and adding performances in computing.
In both cases,a correction term is introduced, that provides a noticeable effect only at ex-tremely large values. It seems to be an interesting parallel, that both natureand extremely cutting-edge technical (computing) systems show up some ex-traordinary behavior, i.e., the linear behavior experienced under normal condi-tions gets strongly non-linear at large values of the independent variable . Thatbehavior makes validity of linear extrapolations as well as linear addition ofperformances at least questionable at high performance values in computing.7.2 Quantal nature of computing timeIn computing, the cooperation of components needs some kind of synchroniza-tion. Typically, a central clock (with clock domains and distribution trees) areused. In today’s technology, the length of the clock cycle is in the ns range,so they seem to be quasi-continuous for the human perception. In some ap-plications, for example when attempting to imitate the parallel operation ofthe nervous system, the non-continuous nature of computing time comes tothe light. In that case the independently running neurons must be put back totheir biological scale, i.e., they stop at the end of the ”integration grid time”values, and distribute their result to the peer neurons. This introduces a ”bi-ological clock time”, which is about a million-fold longer than the ”processorclock time”, and limits the achievable parallelization gain. The effect is no-ticeable only at vast number of cores. The limit is conceptual, but made muchworse with the ”communication burst” the appearance of the simultaneousneed of communication in the neural networks represent. For details see [27].7.3 ’Quantum states’ of supercomputersAs discussed in detail in [4] , the behavior of computing systems under extremeconditions shows surprising parallels with the behavior of natural objects. Re-ally, ”More Is Different” [28]. The behavior of supercomputers is somewhatanalogous with the behavior of quantum systems, where the measurementselects one of its possible states (and at the same time, kills all other possi-ble states). In computing, a supercomputer –as a general-purpose computingsystem– has the potential of high performance defined by the impressive pa-rameters of its components . However, when we run a computation (that is, we he many efficiencies of supercomputers 31 measure the computing performance of our computer), that workload selectsthe best possible combination of limitations that defines performance and killsall other potential performances .The logical dependence of operation of the components implicitly alsomeans their temporal dependence [19] , and it introduces idle times to com-puting. In this way, the workload defines how much of those potential abilitiescan be used: the datasheet values represent hard limits, and their workloadsets the soft limits; given that the components block each other’s operation. The workload defines a fill-out factor: it introduces different idle times intothe operation of components, and in this way, forces workload-defined soft per-formance limits to the components of the supercomputer . Different workloadsforce different limitations (use their available resources differently), giving anatural explanation of their ”different efficiencies” [14]. In other words, run-ning some computation destroys the potentially achievable high performance,defined individually by its components .Benchmarking such computing systems introduces one more limiting com-ponent: the needed computation. For floating-point computations, the ’bestpossible’ (producing the highest figures of merit) benchmark is HPL. Withthe development of parallelization and processor technology, floating compu-tation itself became the major contributor in defining the efficiency and per-formance of the system. Since the benchmark measurement method itself is acomputation, the best measurable floating payload performance value cannotbe smaller than the value that the benchmark procedure itself represents.For real-life programs (such as HPCG) their workload-defined performancelevel (saturation value) are already set well below the Eflops nominal perfor-mance, see Fig. 2. Further enhancements in technology, such as tensor pro-cessors and OpenCAPI connection bus, can slightly increase their saturationlevel but cannot change the science-defined shape of the diagram line.
Supercomputers reached their technical limitations, their development is out ofsteam. To continue enhancing the components of a supercomputer that wantsto run any calculation, without changing its underlying computing paradigm,is not worth any more. To enter the ”next level”, really renewing the classiccomputing paradigm is needed [61–63, 19] .The ironic remark that ’Perhaps supercomputers should just be required tohave written in small letters at the bottom on their shiny cabinets: “Objectmanipulations in this supercomputer run slower than they appear.” [14]’ is be-coming increasingly relevant. The impressive numbers about the performanceof their components (including single-processor and/or GPU performance andspeed of interconnection) are becoming less relevant when going to the ex-tremes. Given that the most substantial α contribution today takes its originin the computation the supercomputer runs, even the best possible benchmarkHPL dominates floating performance measurement. Enhancing other contri- butions, such as interconnection, result in marginal enhancement of their pay-load performance, i.e., the overwhelming majority of expenses increase their”dark performance” only. Because of this, the answers to the questions in thetitle are: there are as many performance values as many measurement meth-ods (that can be varied with how big portion of available cores are used inthe measurement) , and actually benchmarks measure mainly how much math-ematics/communication the benchmark program does, rather than the super-computer architecture (provided that all components deliver their technicallyachievable best parameters). References
1. S. H. Fuller and L. I. Millett, Eds.,
The Future of Computing Performance: Game Overor Next Level?
National Academies Press, Washington, 2011.2. G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-ScaleComputing Capabilities,” in
AFIPS Conference Proceedings , vol. 30, 1967, pp. 483–485.3. J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for multiproces-sors: Methodology and examples,”
Computer , vol. 26, no. 7, pp. 42–50, Jul. 1993.4. J. V´egh and A. Tisan, “The need for modern computing paradigm: Science applied tocomputing,” in
Computational Science and Computational Intelligence CSCI The 25thInt’l Conf on Parallel and Distributed Processing Techniques and Applications . IEEE,2019, pp. 1523–1532. [Online]. Available: http://arxiv.org/abs/1908.026515. J. V´egh, “The performance wall of parallelized sequential computing: the rooflineof supercomputer performance gain,”
Parallel Computing , vol. in review, p.http://arxiv.org/abs/1908.02280, 2019.6. I. Markov, “Limits on fundamental limits to computation,”
Nature , vol. 512(7513), pp.147–154, 2014.7. Liao, Xiang-ke et al, “Moving from exascale to zettascale computing: chal-lenges and techniques,”
Frontiers of Information Technology & ElectronicEngineering
Science
Nature , vol. 551, pp. 554–556,2017.14. IEEE Spectrum, “Two Different Top500 Supercomputing BenchmarksShow Two Different Top Supercomputers,” https://spectrum.ieee.org/tech-talk/computing/hardware/two-different-top500-supercomputing- benchmarks-show-two -different-top-supercomputers, 2017.15. H. Simon, “Why we need Exascale and why we won’t get there by 2020,”in
Exascale Radioastronomy Meeting
Commun. ACM , vol. 31, no. 5, pp.532–533, May 1988.17. S. Krishnaprasad, “Uses and Abuses of Amdahl’s Law,”
J. Comput. Sci. Coll.
Frontiers in Neuroscience
Frontiers in Neuroinformatics , vol. 8, p. 78, 2014.25. S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, andA. D. Brown, “Overview of the SpiNNaker System Architecture,”
IEEE Transactionson Computers , vol. 62, no. 12, pp. 2454–2467, 2013.26. S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B. Stokes, D. R.Lester, M. Diesmann, and S. B. Furber, “Performance Comparison of the Digital Neuro-morphic Hardware SpiNNaker and the Neural Network Simulation Software NEST fora Full-Scale Cortical Microcircuit Model,”
Frontiers in Neuroscience , vol. 12, p. 291,2018.27. J. V´egh, “How Amdahl’s Law limits performance of large artifi-cial neural networks,”
Brain Informatics , vol. 6, pp. 1–11, 2019. [On-line]. Available: https://braininformatics.springeropen.com/articles/10.1186/s40708-019-0097-2/metrics28. P. W. Anderson, “More Is Different,”
Science , vol. 177, pp. 393–396, 1972.29. D. Patterson and J. Hennessy, Eds.,
Computer Organization and design. RISC-V Edi-tion . Morgan Kaufmann, 2017.30. R. Waser, Ed.,
Advanced Electronics Materials and Novel Devices , ser. Nanoelectronicsand Information Technology. Wiley, 2012.31. K. Hwang and N. Jotwani,
Advanced Computer Architecture: Parallelism, Scalability,Programmability , 3rd ed. Mc Graw Hill, 2016.32. V. Weaver, D. Terpstra, and S. Moore, “Non-determinism and overcount on modernhardware performance counter implementations,” in
Performance Analysis of Systemsand Software (ISPASS), 2013 IEEE International Symposium on , April 2013, pp. 215–224.33. P. Moln´ar and J. V´egh, “Measuring Performance of Processor Instructions and Oper-ating System Services in Soft Processor Based Systems,” in , 2017, pp. 381–387.34. F. Ellen, D. Hendler, and N. Shavit, “On the Inherent Sequentiality of ConcurrentObjects,”
SIAM J. Comput. , vol. 43, no. 3, p. 519-536, 2012.35. L. Yavits, A. Morad, and R. Ginosar, “The effect of communication and synchronizationon Amdahl’s law in multicore systems,”
Parallel Computing , vol. 40, no. 1, pp. 1–16,2014.36. J. V´egh and P. Moln´ar, “How to measure perfectness of parallelization in hard-ware/software systems,” in , 2017, pp.394–399.4 J´anos V´egh37. F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Cooperative computingtechniques for a deeply fused and heterogeneous many-core processor architecture,”
Journal of Computer Science and Technology , vol. 30, no. 1, pp. 145–162, Jan 2015.38. M. Mohammadi and T. Bazhirov, “Comparative Benchmarking of Cloud ComputingVendors with High Performance Linpack,” in
Proceedings of the 2Nd InternationalConference on High Performance Compilation, Computing and Communications , ser.HP3C. New York, NY, USA: ACM, 2018, pp. 1–5.39. A. H. Karp and H. P. Flatt, “Measuring Parallel Processor Performance,”
Commun.ACM , vol. 33, no. 5, pp. 539–543, May 1990.40. S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual perfor-mance model for multicore architectures,”
Commun. ACM
Proceedings of the 3rd ACM/SPEC Interna-tional Conference on Performance Engineering
Proceedings of the 2007 Workshop on ExperimentalComputer Science , ser. ExpCS ’07. New York, NY, USA: ACM, 2007, pp. 3–3.46. F. M. David, J. C. Carlyle, and R. H. Campbell, “Context Switch Overheads for Linuxon ARM Platforms,” in
Proceedings of the 2007 Workshop on Experimental ComputerScience , ser. ExpCS ’07. New York, NY, USA: ACM, 2007. [Online]. Available:http://doi.acm.org/10.1145/1281700.128170347. J. V´egh, J. V´as´arhelyi and D. Dr´otos, “The performance wall of large parallelcomputing systems,” in
Lecture Notes in Networks and Systems 68 . Springer, 2019,pp. 224–237. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-12450-2 2148. J. V´egh, “How Amdahl’s law restricts supercomputer applications and buildingever bigger supercomputers,”
CoRR , vol. abs/1708.01462, 2018. [Online]. Available:http://arxiv.org/abs/1708.0146249. T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “Constructing NeuronalNetwork Models in Massively Parallel Environments,”
Frontiers in Neuroinformatics
TheInternational Journal of High Performance Computing Applications
Proceedings of the International Conference for High Performance Computing, Net-working, Storage, and Analysis , ser. SC ’18. IEEE Press, 2018, pp. 47:1–47:11.54. Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance Optimizationof the HPCG Benchmark on the Sunway TaihuLight Supercomputer,”
ACM Trans.Archit. Code Optim. , vol. 15, no. 1, pp. 11:1–11:20, Mar. 2018.55. J. V´egh, “Which scaling rule applies to Artificial Neural Networks,” in
Com-putational Intelligence (CSCE) The 22nd Int’l Conf on Artificial Intelligence(ICAI’20)
How deep machine learning can be , ser. A Closer Look at ConvolutionalNeural Networks. Nova, In press, 2020, pp. 141–169. [Online]. Available:https://arxiv.org/abs/2005.0087258. J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural Networks:Theoretical and Practical Limits of Parallel Scalability,” in
Journal of Physics D: Applied Physics , vol. 52,no. 1, p. 014003, oct 2018.60. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish,M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey,“Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computingon CPU and GPU,” in
Proceedings of the 37th Annual International Symposium onComputer Architecture , ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 451–460.[Online]. Available: http://doi.acm.org/10.1145/1815961.181602161. J. V´egh,
Renewing computing paradigms for more efficient parallelization of single-threads , ser. Advances in Parallel Computing. IOS Press, 2018, vol. 29, ch. 13, pp.305–330. [Online]. Available: https://arxiv.org/abs/1803.0478462. J. V´egh, “Introducing the Explicitly Many-Processor Approach,”
Parallel Computing ,vol. 75, pp. 28 – 40, 2018.63. ——, “How to extend the Single-Processor Paradigm to the Explicitly Many-ProcessorApproach,” in2020 CSCE, Fundamentals of Computing Science