[PDF] Finally, how many efficiencies supercomputers have? And, what do they measure?

Abstract

Using an extremely large number of processing elements in computing systems leads to unexpected phenomena, such as different efficiencies of the same system for different tasks, that cannot be explained in the frame of classical computing paradigm. The simple non-technical (but considering the temporal behavior of the components) model, introduced here, enables us to set up a frame and formalism, needed to explain those unexpected experiences around supercomputing. Introducing temporal behavior into computer science also explains why only the extreme scale computing enabled us to reveal the experienced limitations. The paper shows, that degradation of efficiency of parallelized sequential systems is a natural consequence of the classical computing paradigm, instead of being an engineering imperfectness. The workload, that supercomputers run, is much responsible for wasting energy, as well as limiting the size and type of tasks. Case studies provide insight, how different contributions compete for dominating the resulting payload performance of a computing system, and how enhancing the interconnection technology made computing+communication to dominate in defining the efficiency of supercomputers. Our model also enables to derive predictions about supercomputer performance limitations for the near future, as well as it provides hints for enhancing supercomputer components. Phenomena experienced in large-scale computing show interesting parallels with phenomena experienced in science, more than a century ago, and through their studying a modern science was developed.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Finally, how many eﬃciencies the supercomputers have?

And, what do they measure?

J´anos V´egh

Received: date / Accepted: date

Abstract

Using an extremely large number of processing elements in com-puting systems leads to unexpected phenomena, such as diﬀerent eﬃcienciesof the same system for diﬀerent tasks, that cannot be explained in the frame ofclassical computing paradigm. The simple non-technical (but considering thetemporal behavior of the components) model, introduced here, enables us toset up a frame and formalism, needed to explain those unexpected experiencesaround supercomputing. Introducing temporal behavior into computer sciencealso explains why only the extreme scale computing enabled us to reveal theexperienced limitations. The paper shows, that the degradation of eﬃciency ofparallelized sequential systems is a natural consequence of the classical com-puting paradigm, instead of being an engineering imperfectness. The workload,that supercomputers run, is much responsible for wasting energy, as well aslimiting the size and type of tasks. Case studies provide insight, how diﬀerentcontributions compete for dominating the resulting payload performance of acomputing system, and how enhancing the interconnection technology madecomputing+communication to dominate in deﬁning the eﬃciency of super-computers. Our model also enables to derive predictions about supercomputerperformance limitations for the near future, as well as it provides hints forenhancing supercomputer components. Phenomena experienced in large-scalecomputing show interesting parallels with phenomena experienced in science,more than a century ago, and through their studying a modern science wasdeveloped.

Project no. 136496 has been implemented with the support provided from the NationalResearch, Development and Innovation Fund of Hungary, ﬁnanced under the K fundingscheme.An extended and updated form of MS submitted to J. SupercomputingJ. V´eghKalim´anos BT, HungaryE-mail: [email protected] a r X i v : . [ c s . PF ] D ec J´anos V´egh

Keywords

Supercomputer performance · Parallelized sequential processing · Eﬃciency of supercomputers · Limitations of parallel processing · inherentlimitations of distributed processing · Behavior of extrem-scale systems · ANN performance · ANN eﬃciency · temporal behavior Given that dynamic growing of single-processor performance has stalled abouttwo decades ago [1], the only way to achieve the required high computingperformance remained parallelizing the work of a vast number of sequen-tially working single processors. However, as was very early predicted [2], anddecades later experimentally conﬁrmed [3], the scaling of parallelized com-puting is not linear. Even, ” there comes a point when using more processors. . . actually increases the execution time rather than reducing it ” [3]. Paral-lelized sequential processing has diﬀerent rules of game [3], [4] : its performancegain (”speedup”) has its inherent bounds [5] .Akin to as laws of science limit the performance of single-thread proces-sors [6], the commonly used computing paradigm (through its technical imple-mentation) limits the payload performance of supercomputers [4] . On one side,experts expected performance to achieve the magic 1 Eﬂops around year 2020,see Figure 1 in [7] . ” The performance increase of the No. 1 systems sloweddown around 2013, and it was the same for the sum performance ” [7], butthe authors extrapolated linearly, expecting that the development continuesand ”zettascale computing” (i.e., 10 -fold more than the present performance)shall be achieved in just more than a decade. On the other side, it has recentlybeen admitted that linearity is ” A trend that can’t go on ad inﬁnitum. ” Fur-thermore, that it ” can be seen in our current situation where the historicalten-year cadence between the attainment of megaﬂops, teraﬂops, and petaﬂopshas not been the case for exaﬂops ”[8]. Oﬃcially, the TOP500 [41] evaluationsounds (as of 2020) that ” the top of the list remains largely unchanged ” and” the full list recorded the smallest number of new entries since the project beganin 1993 ”.The expectations against supercomputers are excessive. For example, asthe name of company PEZY witnesses, a billion times increase in payloadperformance is expected. It looks like that in the feasibility studies on super-computing using parallelized sequential systems an analysis whether buildingcomputers of such size is feasible (and reasonable) remained out of sight ei-ther in USA [9,10] or in EU [11] or in Japan [12] or in China [7]. The ”gold There are some doubts about the deﬁnition of exaFLOPS, whether it means nominalperformance R Peak or payload performance R Max , measured by which benchmark (or howit depends on the workload the computer runs) and ﬁnally using what operand length. Herethe term is used as R HPL − Max . To produce higher ﬁgures, several other benchmarks result,not related to ﬂoating computation, have been published. A special issue https://link.springer.com/journal/11714/19/10 https://en.wikipedia.org/wiki/PEZY Computing: The name PEZY is an acronym de-rived from the greek derived Metric preﬁxes Peta, Eta, Zetta, Yottahe many eﬃciencies of supercomputers 3 rush” is going on, even in the most prestigious journals [13,10]. In addition tothe previously existing ” two diﬀerent eﬃciencies of supercomputers ” [14] fur-ther eﬃciency/performance values appeared (of course with higher numericﬁgures), and several more eﬃciencies can easily be derived.Although severe counter-arguments were also published, mainly based onthe power consumption of both single processors and large computing cen-ters [15], the moon-shot of limitless parallelized processing performance isfollowed. The probable source of the idea is the ”weak scaling” [16,3] . How-ever, it is based simply on a misinterpretation [17,18] of terms in Amdahl’slaw [55] . In reality, Amdahl’s Law (in its original spirit) is valid for all paral-lelized sequential activities, including computing-unrelated ones, and it is thegoverning law of distributed (including super-) computing .Demonstrative failures of some systems (such as supercomputers Gyoukouand Aurora’18 , and brain simulator SpiNNaker ) are already known, andmany more are expected to follow: such as Aurora’21 [22], the mystic Chinasupercomputers and the EU planned supercomputers . Fugaku, although itconsiderably enhanced the eﬃcacy of computing, mainly due to clever placingand use mode of its on-chip memory, also stalled at about 40% of its plannedcapacity [23] and could increase its payload performance only marginally in ahalf year.Similar is the case with exascale applications, such as brain simulation. Ex-aggerated news about simulating brain of some animals or a large percentageof the human brain appeared. The reality is that the many-thread implemen-tation of brain simulator can ﬁll an extremely large amount of memory withdata of billions of artiﬁcial neurons [24], a purpose-built (mainly hardware(HW)) brain simulator can be designed to simulate one billion neurons [25].In practice, however, they both can simulate only about 80 thousand neu-rons [26], mainly because of ”the quantal nature of the computing time” [27] .”More Is Diﬀerent” [28].The confusion is growing. The paper attempts to clear up the terms byscrutinizing the basic terms, contributions, measurement methods. In section 2a by intention strongly simpliﬁed non-technical model, based on the tempo-ral behavior of physical implementation of computing [19] , is presented. Thenotations for Amdahl’s Law, which form the basis of the present paper, are The related work and speedup deserved the Gordon Bell Prize As explicitly suspected in [18]:

Gustafson’s formulation gives an illusion that as if Ncan increase indeﬁnitely . It was also learned that speciﬁc processor design is needed for exascale

As part of theannouncement the development line

Knights Hill [20] was canceled and instead be replacedby a ”new platform and new micro-architecture speciﬁcally designed for exascale” Despite its failure, SpiNNaker2 is also under construction [21] https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156 J´anos V´egh Proc Proc Res Res − . . xyt ( Sunway/Taihulight ) α = 1 − · − α = 1 − . · − Total = 10 clocks N cores = 10 N cores = 10 R Max R Peak = N · (1 − α )+ α = · − +1 = 0 . R Max R Peak = N · (1 − α )+ α = 0 . Proc T i m e ( n o t p r o p o r t i o n a l ) α = PayloadTotal P P P P P Access

Initiation

Software

Pre OS Pre T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D T P D P r o c e ss P D Just waitingJust waiting OS Post

Software

Post

Access

Termination P a y l o a d T o t a l E x t e n d e d Fig. 1

Left: The time diagram of parallelized sequential operation in time-space [19] . Right:A conceptual, simpliﬁed model [55] of parallelized sequential computing operations, basedon their temporal behavior. Notice the diﬀerent nature of those contributions, and that theyhave only one common feature: they all consume time . introduced in section 3. It is shown that the degradation of the eﬃciency ofparallelized sequential systems, as was suspected early [3] , is a natural conse-quence of the computing paradigm, rather than an engineering imperfectness(in the sense that it can be ﬁxed later). Furthermore, its consequence is thatthe parallelized sequential computing systems by their very nature have anupper payload performance bound. Diﬀerent contributions form the sequentialportion of the task (and through this, they degrade its parallelized perfor-mance), as detailed in section 4. The established model model is validated insection 5.Given that the race to produce computing systems, having componentsand systems with higher performance numbers, is going on, in section 6, theexpectable results of developments in near future are predicted. The sectionintroduces some further performance merits, and, through interpreting them,concludes that increasing the size of supercomputers further, and making ex-pensive enhancements in their technologies, only increase their non-payloadperformance . In section 7 is discussed that under extreme conditions, techni-cal objects of computing show up a series of behavior (for more details see [4] ),similar to that of natural objects. The performance measurements are simple time measurements (althoughthey need careful handling and proper interpretation, see good textbooks suchas [29]): a standardized set of machine instructions is executed (a large num-ber of times) and the known number of relevant operations is divided by themeasurement time; for both single-processor and distributed parallelized se- Sometimes also secondary merits, such as GFlops/Watt or GFlops/USD are also derivedhe many eﬃciencies of supercomputers 5 quential systems. In the latter case, however, the joint work must also beorganized, implemented with additional machine instructions and additionalexecution time, forming an overhead . This additional activity is the originof eﬃciency: one of the processors orchestrates the joint operation, the othersare idle waiting. At this point, the ” dark performance ” appears: the processingunits are ready to operate, consume power, but do not make any payload work.As discussed in detail in [19] , the ”stealthy nature” of the incremental devel-opment of technology made its appearance unnoticed. However, today, the”idle time” is the primary reason that power consumption is used mostly fordelaying electronic signals [30] inside our computing systems, and deliveringdata rather than performing computations [15].Amdahl listed [2] diﬀerent reasons, why losses in ”computational load” canoccur. Amdahl’s idea enables us to put everything that cannot be parallelized,i.e., distributed between fellow processing units, into the sequential-only frac-tion of the task . For describing the parallelized operation of sequentially work-ing units, the model depicted in Figure 1 was prepared (based on the temporalbehavior of components, as described in [19] ). The technical implementationsof the diﬀerent parallelization methods show up a virtually inﬁnite variety [31],so here a (by intention) strongly simpliﬁed model is presented. However, themodel is general enough to discuss some examples of parallelly working sys-tems qualitatively. We shall neglect diﬀerent contributions as possible in thediﬀerent cases. Our model can easily be converted to a technical (quantitative)one via interpreting its contributions in technical terms, although with someobvious limitations. Such technical interpretations also enable us to ﬁnd outsome technical limiting factors of the performance of parallelized computing.The non-parallelizable (i.e. apparently sequential) part of tasks comprisescontributions from HW, operating system (OS), software (SW) and Prop-agation Delay (PD), and also some access time is needed for reaching theparallelized system. This separation is rather conceptual than strict, althoughdedicated measurements can reveal their role, at least approximately. Somefeatures can be implemented in either SW or HW, or shared between them.Furthermore, some sequential activities may happen partly parallel with eachother. Relative weights of these diﬀerent contributions are very diﬀerent fordiﬀerent parallelized systems, and even within those cases depend on many spe-ciﬁc factors. That means, in every single parallelization case, a careful analysisis required. SW activity represents what was assumed by Amdahl as the totalsequential fraction . Non-determinism of modern HW systems [32] [33] alsocontributes to the non-parallelizable portion of the task: the resulting execu-tion time of parallelly working processing elements is deﬁned by the slowestunit. This aspect is neglected in the weak scaling approximation Although some OS activity was surely included, Amdahl concluded some 20 % SW frac-tion, so at that time the other contributions could be neglectedapart from SW contribution.As shown in Figure 1 and discussed below, for today, this contribution became by severalorders of magnitude smaller. However, at the same time the number of the cores grew severalorders of magnitude larger. J´anos V´egh

Notice that our model assumes no interaction between processes runningon the parallelized system, in addition to the necessary minimum: startingand terminating otherwise independent processes, which take parameters atthe beginning and return their result at the end. It can, however, be triv-ially extended to more general cases when processes must share some resource(such as a database, which shall provide diﬀerent records for the diﬀerent pro-cesses), either implicitly or explicitly. Concurrent objects have their inherentsequentiality [34]. Synchronization and communication between those objectsconsiderably increase [35] the non-parallelizable portion (i.e. contribution to(1 − α SWeff ) or (1 − α OSeff )) of the task. Because of this eﬀect, in the case of anextremely large number of processors, special attention must be devoted totheir role on eﬃciency of applications on parallelized systems.The physical size of the computing system also matters. A processor, con-nected to the ﬁrst one with a cable of length of dozens of meters, must spendseveral hundred clock cycles with waiting. This waiting is only because ofthe ﬁnite speed of the propagation of light, topped by the latency time andhoppings of their interconnection (not mentioning geographically distributedcomputer systems, such as some clouds, connected through general-purposenetworks). Detailed calculations are given in [36] .After reaching a certain number of processors, there is no more increasein the payload fraction when adding more processors. The ﬁrst fellow pro-cessor already ﬁnished its task and is idle waiting, while the last one is stillidle waiting for its start command. This limiting number can be increased byorganizing the processors into clusters: the ﬁrst computer must speak directly only to the head of the cluster. Another way is to distribute the job near tothe processing units. It can happen either inside the processor [37,23], or onecan let to do the job by the processing units of a Graphic Processing Unit(GPU) .This looping contribution is not considerable (and, in this way, not notice-able) at a low number of processing units (apart form the other contributu-ions), but can be a dominating factor at a high number of processing units.This ”high number” was a few dozens at the time of writing the paper [3],today it is a few millions . The method, how the eﬀect of the looping contri-bution is considered, is the border line between ﬁrst and second-order approx-imations in modeling system’s payload performance. The housekeeping keepsgrowing with the number of processors. In contrast, system’s resulting per-formance does not increase anymore. The ﬁrst-order approximation considersthe contribution of housekeeping as constant. The second-order approximationalso considers, that as the the number of processing units grows, housekeep-ing grows with, and gradually becomes the dominating factor of performancelimitation, and leads to a decrease in payload performance. Notice, however, that any additional actor on the scene increases the latency of compu-tation. Strongly depends on the workload and the architecture.he many eﬃciencies of supercomputers 7

As Figure 1 shows, in parallelized operating mode (in addition to compu-tation, furthermore communication of data between its processing units) bothsoftware and hardware contribute to the execution time , i.e. they both mustbe considered in Amdahl’s Law. This is not new, again: see [2]. Figure 1 alsoshows where is place to improve computing eﬃciency. When combining PDproperly with sequential scheduling, non-payload time can be considerably re-duced during ﬁne-tuning the system (see the cases of performance increasesof

Sierra and

Summit , a half year after their appearance on the TOP500list). Also, mismatching total time and extended measurement time (or notmaking a proper correction) may lead to completely wrong conclusions [38] asdiscussed in [36] . Usually, Amdahl’s law is expressed as S − = (1 − α ) + α/N (1)where N is the number of parallelized code fragments (or Processing Unit(PU)s), α is the ratio of the parallelizable portion to the total, S is the mea-surable speedup. From this α = NN − S − S (2)When calculating the speedup, one actually calculates S = (1 − α ) + α (1 − α ) + α/N = NN · (1 − α ) + α (3)hence the resulting eﬃciency of the system (see Figure 2) E ( N, α ) = SN = 1 N · (1 − α ) + α = R Max R P eak (4)This phenomenon itself is known since decades [3], and α is theoreticallyestablished [39]. Presently, however, the theory was somewhat faded, mainlydue to the quick development of parallelization technology and the increase ofsingle-processor performance.During the past quarter of a century, the proportion of contributions changedconsiderably: today the number of processors is thousands-fold higher than itwas a quarter of a century ago. The growing physical size and the higher pro-cessing speed increased the role of the propagation overhead, furthermore, thelarge number of processing units strongly ampliﬁed the role of the loopingoverhead. As a side-result of the technological development, the phenomenonon performance limitation returned in a technically diﬀerent form, at muchhigher number of processors . J´anos V´egh − − − − − − − − − − N o o f p r o c e s s o r s N o n − p a y l o a d / p a y l o a d E ff i c i e n c y TOP500’2020.11FugakuSummitSierraTaihulightSeleneJewelsK computer ≈ Brain

Fig. 2

The 2-parameter eﬃciency surface (in function of parallelization eﬃciency and num-ber of the processing elements), as concluded from Amdahl’s Law (see Eq. (4)), in ﬁrst orderapproximation. Some sample eﬃciency values for some selected supercomputers are shown,measured with benchmarks HPL and HPCG, respectively. ”

This decay in performance isnot a fault of the architecture, but is dictated by the limited parallelism ” [3]

Through using Eq. (4), E = SN = R Max R Peak can be equally good for describingthe eﬃciency of parallelization of a setup: α E,N = E · N − E · ( N −

1) (5)As we discuss below, except for extremely high number of processors, it canbe safely assumed that the value of α is independent from the number ofprocessors in the system. Eq. (5) can be used to derive the value of α fromvalues of parameters R Max /R P eak and the number of cores N . According to Eq. (4), the payload eﬃciency can be described with a 2-dimensional surface , as shown in Figure 2. On the surface, some measuredeﬃciencies of the present top supercomputers are also depicted, just to illus-trate some general rules. Both the HPL and the HPCG eﬃciency valuesare displayed. The measured values are projected back to the axes to enable us to compare the corresponding values of their processors and parallelizationeﬃciency.The measured values can be separated in two groups. The recent trend isthat only a small fraction of their cores are used in the HPCG benchmark-ing, while of course all cores are used in the HPL benchmarking. As discussedabove, their eﬃciency of course depends on their workload, so the two bench-marked eﬃciency values can be projected to diﬀerent number of cores anddiﬀerent non-payload to payload ratios on the axis. The other group comprisesmeasurements where the same number of cores were used in both benchmarks.For this group, for visibility, only the HPL projections are displayed.In this latter group, it can be seen immediately, that as the the eﬃciencysharply decreases with the number of the cores in the system. In the formergroup, only about 10 % of the total cores is used, and the two eﬃciency valuesdiﬀer by about an order of magnitude. The general experience showed thatthe ratio of HPL-to-HPCG eﬃciency is about 200-500, when using the samenumber of cores. This is why these entries reduced their number of cores in thesecond benchmark: the payload performance reached its ”rooﬂine” [40] [55] level at that number of cores; using all cores would decrease the systems per-formance by an order of magnitude only because of the higher number ofcores.There is an inﬂexion point in the performance: ” there comes a point whenusing more PUs . . . actually increases the execution time rather than reducingit ” [3]. As can be concluded from the ﬁgure, increasing the systems’s nominalperformance by an order of magnitude, at the same time, decreases its eﬃ-ciency (and so: its payload performance ) by more than an order of magnitude.For HP CG benchmark: the ”rooﬂine” [40] of that non-payload-to-payload in-tensity was already reached, all computers have about the same eﬃciency, atthe same number of PUs .Noticeable that in Fig. 2 the systems having the best eﬃciency values donot use accelerator: the eﬃciency of systems using accelerators is much loweralso in the case of HPL benchmark, but it is more disadvantageous in thecase of HPCG benchmark. As can be seen, the ”roof-line” eﬃciency can bereached with lower number of cores. In other words, the accelerators enablethe systems to reach much worse non-payload to payload ratio.A direct proof of our statement is the rough estimation, that – based onour Equ. (4)– if we divide the HPCG eﬃciency of the supercomputers inthe ﬁrst group by the ratio of their cores used in the benchmark, we arriveabout the same eﬃciency value in the case of all supercomputers. Commentssuch as ”

The HPCG performance at only 0.3% of peak performance showsthe weakness of the Sunway TaihuLight architecture with slow memory andmodest interconnect performance ” [44] and ”

The HPCG performance at only2.5% of peak performance shows the strength of the memory architecture per-formance ” [23] show that supercomputing experts did not realize, that theeﬃciency is a(n at least) 2-parameter function, depending on both the numberof PUs in the system and its workload. That dependency deﬁnes the achiev-able payload performance. It looks like the community experienced the eﬀect

HPL P r o ce ss o r p e r f o r m a n ce ( G ﬂ o p / s ) AcceleratedNon-acceleratedGPU-acceleratedRegression of acceleratedRegression of nonacceleratedRegression of GPU accelerated 0 10 20 30 40 5010 − − − − − Ranking by

HPL ( − α H P L e ff ) AcceleratedNon-acceleratedGPU-acceleratedRegression of acceleratedRegression of nonacceleratedRegression of GPU accelerated

Fig. 3

Correlation of performance of processors using accelerator and eﬀective parallelismwith ranking, in 2017. of the two-dimensional eﬃciency, but does not want to comprehend its reason,despite the early and clear prediction: ” this decay in performance is not a faultof the architecture, but is dictated by the limited parallelism ” [3]. In excessivesystems of modern HW, it is also dictated by the laws of nature [4] . Further-more, its dependence can be perfectly described by the properly interpretedAmdahl’s Law, rather than being ”empirical eﬃciency”.As suggested by Equ. (6), the trivial way to increase the resulting pay-load performance of a supercomputer is to increase the single-processor per-formance of its processors. Given that the single processor performance hasreached its limitations, some kind of accelerators are frequently used for thisgoal.Fig. 3 shows how utilizing accelerators inﬂuences the ranking of supercom-puters. The left subﬁgure shows, that the ranking of a supercomputer doesnot depend on which method of acceleration it uses. Essentially the same isconﬁrmed by the right subﬁgure: the eﬀective parallelization raises with theranking position, and the slope is the same for any kind of acceleration. Asthe left subﬁgure depicts, GPU-accelerated processors really increase the pay-load performance of the system by a factor of 2-2.5. However, this increasedperformance is about 40 times lower than the expectation based on nomi-nal performance of GPU accelerator. The right subﬁgure, however, discovers,that the eﬀective parallelization (the non-payload to payload ratio) of GPU-accelerated systems is nearly an order of magnitude worse than that of thenon-accelerated systems, i.e. the resulting eﬃciency can be (depending on thesize of the system) worse than in the case of utilizing unaccelerated processors;this can be a deﬁnite disadvance when GPUs used in a system with extremelylarge number of processors.Noticably, the two new items in Fig. 2 (Selene and Jewels; based on entirelyGPUs), not only show the worst eﬃcacy for HPL benchmark, but their HPCGeﬃciencies push back the number of usable cores (for real-life tasks) well belowthe hundred thousand limit. On one side, the achieved HPCG eﬃciency valuesshow the same tendency as the HPL eﬃciency values: the more cores, the he many eﬃciencies of supercomputers 11 lower eﬃciency. On the other side, the ﬁgure shows that for real-life tasks thereasonable size of supercomputers is below 1M core for non-accelerated cores,and well below 0.1M cores for accelerated ones. The eﬃciency is just a fewpercent, and increasing the number of cores decreases their eﬃciency (doesNOT increase their payload performance).The reason is that GPUs and PUs have separated address spaces, anddata shall be moved from one address space to the other, that needs timeand increases the latency of the processor. Turning the memory into (partly)active element, with using diﬀerent ’coherence’ solutions such as OpenCAPI,can mitigate this eﬀect. See also section 5.In the HPL class, the communication intensity is the lowest possible one:computing units receive their task (and parameters) at the beginning of thecomputation, and they return their result at the very end. That is, the coreorchestrating their work must deal with the fellow cores only in these periods,so their communication intensity is proportional to the number of cores in thesystem. Notice the need to queue requests at the beginning and at the end ofthe task.In the HPCG class, iteration takes place: cores return the result of oneiteration to the coordinator core, which makes sequential operations: not onlyreceives and re-sends parameters, but also needs to compute new parametersbefore sending them to the fellow cores. The program repeats the processseveral times. As a consequence, the non-parallelizable fraction of the bench-marking time grows proportionally to the number of iterations and the size ofthe problem . The eﬀect of that extra communication decreases the achievableperformance rooﬂine [40]: as shown in Fig. 4, the HPCG rooﬂine is about 200times smaller than the HPL one, as discussed in section 3. It is remarkablethat the performance gain rooﬂine for HPL and the one for HPCG diﬀer onlymarginally. Their diﬀerence can be attributed to the extra non-payload con-tribution of benchmark HPCG. Notice that the performance gain values forHPCG are measured with using only a fraction of the available cores; the realperformance gain should be an order of magnitude lower. With that correction,a sharp breakdown of the performance gain can be observed, as theoreticallypredicted [55] . Only three of the ten top supercomputers used all of their cores when run-ning HPCG , while, of course, they used all their cores when running benchmarkHPL. Comments such as ”

The HPCG performance at only 0.3% of peak per-formance shows the weakness of the Sunway TaihuLight architecture with slowmemory and modest interconnect performance ” [44] and ”

The HPCG perfor-mance at only 2.5% of peak performance shows the strength of the memoryarchitecture performance ” [23] show that supercomputing experts did not re-alize, that the eﬃciency is a(n at least) 2-parameter function number of thenumber of PUs in the system, and that the workload deﬁnes the achievablepayload performance.Notice two more rooﬂines. During the development of the interconnec-tion technology, between years 2004 and 2012, diﬀerent implementation ideashave been around, and they competed for years. The beginning of the second Year P e r f o r m a n c e g a i n st by R HP LMax nd by R HP LMax rd by R HP LMax

Best by α

HP Leff st by R HP CGMax nd by R HP CGMax rd by R HP CGMax

Best by R

HP CGMax

Brain simulation

Fig. 4

The performance gain of top supercomputers, in function of their year of production.The marks display the measured values derived using HPL and HPCG benchmarks, for theTOP3 supercomputers. The small black dots mark the performance data of supercomputers

JUQUEEN and K as of 2014 June, for HPL and HPCG benchmarks, respectively. Thebig black dot denotes the performance value of the system used by [26]. The red pentagonsdenote performance gain, measured using half-precision operands. The saturation eﬀect canbe observed for both HPL and HPCG benchmarks. rooﬂine, around year 2011, coincides with the dawn of GPUs, interfering withthe eﬀect of the interconnection technology.The top rooﬂine dawned with the appearance of T aihulight : some assistantprocessors take over part of the duties of the individual cores, and in this waythe non-payload portion of the workload can be mitigated.It is important to notice the two red pentagons in the ﬁgure: they representthe performance gain achieved using half-precision operand length. The per-formance gain is lower than the double precision equivalent, just because of theincreased relative weight of housekeeping, as discussed in detail in section 4.4.The projections of eﬃciency values to axes show that the top few supercom-puters show up very similar parallelization eﬃciency and core number values:they are both required to receive one of the top slots; see also Figure ?? .Supercomputers T aihulight and

F ugaku are exceptions on both axes. Theyhave the highest number of cores and the best

HP L parallelization eﬃciency.An interesting coincidence is, that processor of both supercomputers have ”as- he many eﬃciencies of supercomputers 13 sistant cores” (i.e., some cores do not make payload computing, instead theytake over ”housekeeping duties”). This solution decreases the internal latencyof processors making payload computing and increases the eﬃciency of the sys-tem. They both use a ”light-weight operating system” (and so does

F ugaku and

Sierra ; four out of the ﬁrst four), also to reduce processor latency. Thiseﬃciency, of course, requires executing several ﬂoating instructions per clockcycle. That mode of operation gets more and more challenging for the inter-connection, delivering data to and from data processing units. Notice also intheir cases the role of ”near” memories: as explained in [19] , the data deliverytime considerably increases the ”idle time” of computing. This idle time iswhy

F ugaku , with its cleverly placed L cache memories, can be more eﬀec-tive when measured with HPL. This trick, however, is not working in the caseof HP CG , because its ”sparse” computations use those cache memories inef-fectively. The ”true”

HP CG eﬃciency of

F ugaku is expected to be betweenthe corresponding values of

Summit and

T aihulight .In addition, processors of

T aihulight comprise cooperating cores [37]. Thedirect core-to core transfer uses a (slightly) diﬀerent computing paradigm: theprocessor cores explicitly assume the presence of another core, and in this way,their eﬀective parallelism becomes much better, see also Fig. 6. On that ﬁgure,this data and the ones using shorter operands (

Summit and

F ugaku ), resultin eﬀective parallelization values below the limiting line. Reducing the loopcount by internal clustering (in addition to the ”hidden clustering”, enabledby its assistant cores) and exchanging data without using global memory, how-ever, works only for the

HP L case, where the contribution of SW is low. Thepoor value of (1 − α HP CGeff ) is not necessarily a sign of architectural weak-ness [9]:

T aihulight comprises about four times more cores than

Summit andperforms the HPCG benchmark with ten-fold more cores. Given that HPCGmimics ”real-life” applications, one can conclude that for practical purposesonly systems comprising a few hundred thousand cores shall be built. Morecores contribute only to the ”dark performance”.According to Eq. (4), eﬃciency can be interpreted in terms of α and N , andthe eﬃciency of a parallelized sequential computing system can be calculatedas P ( N, α ) = N · P single N · (1 − α ) + α (6)This simple formula explains why the payload performance is not a linearfunction of the nominal performance and why in the case of very good paral-lelization ((1 − α ) (cid:28)

1) and low N , this nonlinearity cannot be noticed.The value of α , however, can hardly be calculated for the present complexHW/SW systems from their technical data (for a detailed discussion see [55] ).Two ways can be followed to estimate their value of α . One way is to calcu-late α for existing supercomputing systems (making ”computational experi-ments”[18]) applying Eq. (5) to data in TOP500 list [4] . This way provides a Assuming good interconnection. For computing systems connected with general-purposenetworks, such as some high-performance clouds, this limiting number is much less4 J´anos V´egh lower bound for (1 − α ), which is already achieved. Another way round is toconsider contributions of diﬀerent origin, see section 4, and to calculate thehigh limit of the value of (1 − α ), that the given contributions alone do not en-able to exceed (provided that that contribution is the dominant one). It givesus good conﬁdence in the reliability of the parameters that values derived inthese two ways diﬀer only within a factor of two. At the same time, this alsomeans that technology is already very close to its theoretical limitations.Notice that the ”algorithmic eﬀects” – such as dealing with ”sparse” datastructures (which aﬀects cache behavior, that will have a growing importancein the age of ”real-time everything” and neural networks) or communicationbetween parallelly running threads, such as returning results repeatedly to themain thread in an iteration (which greatly increases non-parallelizable fractionin the main thread) – manifest through the HW/SW architecture, and theycan hardly be separated. Also notice, that there are one-time and ﬁxed-sizecontributions, such as utilizing time measurement facilities or calling systemservices. Since α eff is a relative merit, the absolute measurement time shall belarge. When utilizing eﬃciency data from measurements, which were dedicatedto some other goal, proper caution must be exercised with the interpretationand accuracy of those data.The ’right eﬃciency metric’ [42] has always been a question (for a summarysee cited references in [43]) when deﬁning eﬃcient supercomputing. The goalof the present discussion is to ﬁnd out the inherent limitations of parallelizedsequential computing, and to provide numerical values for it. For this goal,the ’classic interpretation’ [2,3,39] of performance was used, in its originalspirit. Contributions mentioned in those papers were scrutinized and theirimportance under current technical conditions revised.The left subﬁgure of Figure ?? shows, that to get better ranking on theTOP500 list, a higher number of processors is required. The regression line isdiﬀerent for TOP10 and the TOP50 positions. The cut line between ”racing”and ”commodity” supercomputers is around slot 10. As the right subﬁgure un-derpins, high number of processors must be accompanied with good paralleliza-tion eﬃciency , otherwise, the large number of cores cannot counterbalance thedecreasing eﬃciency, see Eq. (6). α Theory can display data from systems with any contributors with any pa-rameters, but from measured data, only the sum of all contributions can beconcluded, although dedicated measurements can reveal value of separatedcontributions experimentally. The publicly available data enable us to drawonly conclusions of limited validity. he many eﬃciencies of supercomputers 15 α The estimations below assume that the actual contribution is the dominatingone, and as such, deﬁnes the achievable performance alone. This situation isusually not the case in practice, but this approach enables us to ﬁnd out thelimiting (1 − α ) values for all contributions.In the systems implemented in Single Processor Approach (SPA) [2] as par-allelized sequential systems, the life begins in one such sequential subsystem,see also Fig. 1. In large parallelized applications, running on general-purposesupercomputers, initially and ﬁnally only one thread exists, i.e., the minimalnecessary non-parallelizable activity is to fork the other threads and join themagain.With the present technology, no such actions can be shorter than one pro-cessor clock period . That is, the theoretical absolute minimum value of thenon-parallelizable portion of the task will be given as the ratio of the time ofthe two clock periods to the total execution time. The latter time is a freeparameter in describing eﬃciency. That is, the value of eﬀective paralleliza-tion α eff depends on the total benchmarking time (and so does the achievableparallelization gain, too).This dependence is, of course, well known for supercomputer scientists:for measuring the eﬃciency with better accuracy (and also for producing bet-ter α eff values), hours of execution times are used in practice. In the caseof benchmarking supercomputer T aihulight [44], 13,298 seconds HPL bench-mark runtime was used; on the 1.45 GHz processors it means 2 ∗ clockperiods. The inherent limit of (1 − α eff ) at such benchmarking time is 10 − (or, equivalently, the achievable performance gain is 10 ). In the paper, forsimplicity, 1.00 GHz processors (i.e., 1 ns clock cycle time) will be assumed.Supercomputers are also distributed systems. In a stadium-sized super-computer, a distance between processors (cable length) up to about 100 mcan be assumed. The net signal round trip time is ca. 10 − seconds, or 10 clock periods, i.e., in the case of a ﬁnite-sized supercomputer performance gaincannot be above 10 , only because of the physical size of the supercomputer.The presently available network interfaces have 100. . . 200 ns latency times,and sending a message between processors takes time in the same order ofmagnitude, typically 500 ns. This timing also means, that making better in-terconnection is not a bottleneck in enhancing performance . This statement isalso underpinned by the discussion in section 4.3.These predictions enable us to assume,that the presently achieved value of(1 − α eff ) could also persist for roughly a hundred times more cores. However,another major issue arises from the computing principle SPA: only one pro-cessor at a time can be addressed by the ﬁrst one. As a consequence, at least This statement is valid even if some parallelly working units can execute more thanone instruction in a clock period. One can take these two clock periods as an ideal (butnot realistic) case. However, the actual limitation will inevitably be (much) worse than theone calculated for this idealistic case. The exact number of clock periods depends on manyfactors, as discussed below.6 J´anos V´egh as many clock cycles are to be used for organizing the parallelized work asmany addressing steps required. This number equals to the number of cores insupercomputer, i.e., the addressing operation in supercomputers in the TOP10positions typically needs clock cycles in the order of 5 ∗ . . . 10 ; degradingthe value of (1 − α eff ) into the range 10 − . . . 2 ∗ − . Two tricks may be usedto mitigate the number of the addressing steps: either cores are organized into cluster s as many supercomputer builders do, or at the other end, the processoritself can take over the responsibility of addressing its cores [37]. In functionof the actual construction, the reducing factor of clustering of those types canbe in the range 10 . . . 5 ∗ , i.e the resulting value of (1 − α eff ) is expectedto be around 10 − .An operating system must also be used for protection and convenience.If fork/join is executed by the OS as usual, because of the needed contextswitchings 2 ∗ [45,46] clock cycles are used, rather than 2 clock cyclesconsidered in the idealistic case. The derived values are correspondingly by fourorders of magnitude diﬀerent , that is, the absolute limit is ≈ ∗ − , on a zero-sized supercomputer. This value is somewhat better than the limiting valuederived above, but it is close to that value and surely represents a considerablecontribution. This limitation is why a growing number of supercomputers uses”ligh-weight kernel” or runs its actual computations in kernel mode; a methodof computing that can be used only with well-known benchmarks.However, this optimistic limit assumes that an instruction can be accessedin one clock cycle. It is usually not the case, but it seems to be a good approxi-mation. On one side, even a cached instruction in the memory needs about ﬁvetimes more access time, and the time required to access ’far’ memory is roughly100 times longer. Correspondingly, the most optimistic achievable performancegain values shall be scaled down by a factor of 5 . . . α HP Leff and α HP CGeff can be attributed toa diﬀerent cache behavior because of the ’sparse’ matrix operations.4.2 The eﬀect of workﬂowThe overly complex Figure 5 attempts to explain the phenomenon, why andhow the performance of a supercomputer conﬁguration depends on the ap-plication it runs. The non-parallelizable fraction (denoted on the ﬁgure by α Xeff ) of the computing task comprises components X of diﬀerent origin. As al-ready discussed, and was noticed decades ago, ” the inherent communication-to-computation ratio in a parallel application is one of the important determinantsof its performance on any architecture ” [3], suggesting that communicationcan be a dominant contribution in system’s performance . Figure 5.A displaysa case with minimum communication, and Figure 5.B a case with moderatelyincreased communication (corresponding to real-life supercomputer tasks). Asthe nominal performance increases linearly and the payload performance de-creases inversely with the number of cores, at some critical value, where aninﬂection point occurs, the resulting payload performance starts to drop. The he many eﬃciencies of supercomputers 17 x a a a a m y Input Layer ”HPL Layer” Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α H P L e ff ) − − − − − R H P L M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) A x a a a a m y y n y n n n n m Input Layer ”HPCG Layer 1” ”HPCG Layer n” Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α H P C G e ff ) − − − − − R H P C G M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) B x x x x n a a a a m n n n n m y y y y k Input Layer Hidden Layer 1 Hidden Layer n Output Layer 10 − − − − − − − − − − R Peak ( Eflop/s ) ( − α NN e ff ) − − − − − R NN M a x ( E f l o p / s ) α SW α OS α eff R Max ( Eflop/s ) C Fig. 5

The ﬁgure explains how the diﬀerent communication/computation intensities of ap-plications lead to diﬀerent payload performance values in the same supercomputer system.Left column: models of the computing intensities for diﬀerent benchmarks. Right column:the corresponding payload performances and α contributions in function of the nominal per-formance of a ﬁctive supercomputer ( P = 1 Gflop/s @ 1

GHz ). The blue diagram line refersto the right hand scale ( R Max values), all others ((1 − α Xeff ) contributions) to the left handscale. The ﬁgure is purely illustrating the concepts; the displayed numbers are somewhatsimilar to real ones. The performance breakdown shown in the ﬁgures were experimentallymeasured by [3], [49](Figure 7) and [26](Figure 8). resulting non-parallelizable fraction sharply decreases eﬃcacy (or, in otherwords: performance gain or speedup) of the system [47, 48] . The eﬀect wasnoticed early [3], under diﬀerent technical conditions, but somewhat faded dueto successes of development of parallelization technology.Figure 5.A illustrates the behavior measured with HPL benchmark. Thelooping contribution becomes remarkable around 0.1 Eﬂops, and breaks downpayload performance (see also Figure 1 in [3]) when approaching 1 Eﬂops. InFigure 5.B, the behavior measured with benchmark HPCG is displayed. Inthis case, the contribution of the application (brown line) is much higher, thein the previous case. The looping contribution (thin green line) is the same as above. Consequently, the achievable payload performance is lower, and alsothe breakdown of the payload performance is softer in the case of real-lifetasks.Given that no dedicated measurements exist, it is hard to make a directcomparison between theoretical prediction and measured data. However, theimpressive and quick development of interconnecting technologies provides ahelping hand.4.3 Contribution of the interconnectionAs discussed above, in a somewhat simpliﬁed view, the resulting performancecan be calculated using the contributions to α as P ( N, α ) = N · P single N · (1 − α Net − α Compute − α Others ) + ≈ diﬀerence of measured α can be directlycompared to diﬀerence of the corresponding sum of calculated α values, al-though here only qualitative agreement can be expected.Both quality of the interconnection and nominal performance are a para-metric function of their time of construction, so one can assume on the theoryside that (in a limited period), interconnection contribution was changing infunction of nominal performance as shown in Figure 6A. The other majorcontribution is assumed to be computation itself. Benchmark computationcontributions for HPL and HPCG are very diﬀerent, so the sum of the respec-tive component, plus the interconnection component are also very diﬀerent.Given that at the beginning of the considered period, the contribution fromHPCG calculation and that of interconnection were in the same order of mag-nitude, their sum only changed marginally (see upper diagram lines), i.e., themeasured performance improved only slightly. Because benchmark HPCG iscommunication-bound (and so are real-life programs), their eﬃciency would bean order of magnitude worse. The reason is Eq. (4): when supercomputers useall of their cores, their achievable performance is not higher (or maybe evenlower), only their power consumption is higher (and the calculated eﬃciencyis lower). As predicted: ” scaling thus put larger machines at an inherent disad-vantage ” [3]. The cloud-like supercomputers have a disadvantage in the HPCGcompetition: the Ethernet-like operation results in relatively high (1 − α ) val-ues. This time also accessing data (”accessing data” is included)he many eﬃciencies of supercomputers 19 − − − − − − − − − − R Peak ( Eflop/s ) ( − α ) α total and α interconnection , theory α Interconnect α HPL α HPCG α Interconnect + HPL αInterconnect + HPCG A − − − − − − − − − − R Peak ( Eflop/s ) ( − α ) α total and α interconnection , measured June 2009, HPLJune 2011, HPLJune 2013, HPLJune 2015, HPLJune 2017, HPLJune 2019, HPLJune 2019, HPLJune 2020, HPL-AIJune 2017, HPCGJune 2020, HPCG B Fig. 6

The eﬀect of changing the dominating contribution. The left subﬁgure shows thetheoretical estimation, the right subﬁgure the corresponding measured data, as derived frompublic database TOP500 [50] (only values for the ﬁrst four supercomputers are shown). Whenthe contribution from interconnection drops below that of computation, the value of (1 − α )(and the performance gain) get saturated. The red ’x’ marks denote half-precision values. The case with HPL calculation is drastically diﬀerent (see the lower di-agram lines). Since in this case, at the beginning of the considered period,the contribution from interconnection is very much larger than that from thecomputation, the sum of these two contributions changes sensitively as thespeed of the interconnection improves. As soon as the contribution from inter-connection decreases to a value comparable with that from the computation,the decrease of the sum slows down considerably, and further improvement ofinterconnection causes only marginal decrease in the value of the resulting α (and so only a marginal increase in the payload performance).The measured data enable us to draw the same conclusion, but one mustconsider that here multiple parameters may have been changed. Their ten-dency, however, is surprisingly clear. Figure 6.B is actually a 2.5D diagram:the size of marks is proportional to the time passed since the beginning ofthe considered period. A decade ago, the speed of interconnection gave themajor contribution to α total . Enhancing it drastically in the past few years,increased eﬃcacy. At the same time, because of the stalled single-processorperformance, other technology components only changed marginally. The com-putational contribution to α from benchmark HPL remained constant in func-tion of time, so the quick improvement of interconnection technology resultedin a quick decrease of α total , and the relative weights of α Net and α Compute reversed.

The decrease in value of (1 − α ) can be considered as the result ofdecreased contribution from interconnection. However, the total α contribution decreased considerably only until α Net reached the order of magnitude of α Compute . This match occurred in the ﬁrst4-5 years of the period shown in Figure 6B: the sloping line is due to theenhancement of interconnection. Then, the two contributors changed their role, and the constant contribution due to computation started to dominate,i.e., the total α contribution decreased only marginally. As soon as computingcontribution took over the dominating role, the value of α total did not fall anymore: all measured data remained above that value of α . Correspondingly, thepayload performance improved only marginally (and due to factors other thaninterconnection).At this point, as a consequence of that the dominating contributor changed,it was noticed, that eﬃcacy of benchmark HPL and eﬃcacy of real-life appli-cations started to diﬀer by up to two orders of magnitude. Because of this, anew benchmark program HPCG [51] was introduced, since ” HPCG is designedto exercise computational and data access patterns that more closely match adiﬀerent and broad set of important applications ” [52].Since the major contributor is computing itself, the diﬀerent benchmarkscontribute diﬀerently and since that time ” supercomputers have two diﬀerenteﬃciencies ” [14]. Yes, if the dominating α contribution (from the benchmarkcalculation) is diﬀerent, then the same computer shows diﬀerent eﬃcienciesin function of the computation it runs. Since that time, the interconnectionprovides less contribution than the computation of the benchmark. Due tothat change, enhancing the interconnection contributes mainly to the darkperformance , rather than to the payload performance .4.4 The eﬀect of reduced operand lengthThe so-called HPL-AI benchmark used Mixed Precision rather than DoublePrecision computations. The name suggests that AI applications may run onthe supercomputer with that eﬃciency. However, the type of workload doesnot change, so that one can expect the same overall behavior for AI applica-tions, including Artiﬁcial Neural Network (ANN)s, than for double-precisionoperands. For AI applications, limitations remain the same as described above;except that when using Mixed Precision, eﬃciency can be better by a factorof 2 · · ·

3, strongly depending on the workload of the system; see also Fig. 8.Unfortunately, when using half-precision, the enhancement comes from ac-cessing less data in memory and using faster operations on shorter operands,instead of reducing communication intensity , that deﬁnes eﬃciency . Simi-larly, exchanging data directly between processing units [37] (without usingglobal and even local memory) also enhances α (and payload performance) [54],but it represents a (slightly) diﬀerent computing paradigm. Only the two men-tioned measured data fall below the limiting line of (1 − α ) in Figure 6.Recent supercomputers F ugaku [23] and

Summit [53] provided their HPLperformance for both 64-bit and 16-bit operands. Of course, their performance Both names are rather inconsequent. On one side, the test itself has not much to dowith AI, it just uses the operand length common in Artiﬁcial Intelligence (AI) tasks; (HPL,similarly to AI, is a workload type ). On the other side, Mixed Precision is actually HalfPrecision: for multiplication twice as long operands are used temporarily. It is a diﬀerentquestion is if those operations are contracted. On the contrary, the relative weight of communication is increased in this way.he many eﬃciencies of supercomputers 21 seems to be much better with shorter operand length (at the same numberof operations, the total measurement time is much shorter). It was expectedthat their performance should be four times higher when using four timesshorter operands. Power consumption data [53] underpin the expectations:power consumption is about four times lower for four times shorter operands.Computing performance, however, shows a slighter performance enhancement:3.01 for

Summit and 3.42 for

F ugaku , because of the needed housekeeping.In the long run, a

T ime X value comprises housekeeping and computation.We assume that housekeeping (indexing, branching) is the same ﬁxed amountfor diﬀerent operand lengths, and the other time contribution (data deliveryand bit manipulation) is proportional with the operand length. Given thataccording to our model, the measured payload performance directly reﬂectsthe sum of all contributions, we can assume that T ime = F + F T ime = F + 4 · F (8)where F is the time contribution from housekeeping (in a long-term run, usingbenchmark HPL), and F is the time contribution due to manipulating 16bits.We can use two simple models when calculating the relative values of F and F . Both models are speculative rather than technically established ones,but they nicely point to the important point: the role of housekeeping.If no part of housekeeping is parallelized with ﬂoating computing, further-more, computing and data transfer operations do not block each other, thesimple summing above can be used, see Fig. 7. With proper parallelizationmethod, the non-ﬂoating housekeeping for the next operand could be per-formed in parallel with the ﬂoating operation of the current operand, so thetheoretically possible ratio of four could be reached. Notice, however, that thismodel is not aware of the data transfer time.In the case of considering the temporal behavior of the components, thetrigonometric sum of the non-payload and payload times shall be used, see [19] . T ime = (cid:113) F + F T ime = (cid:113) F + (4 · F ) (9)Table 1 shows the data calculated from the published values in TOP500list for supercomputers F ugaku and

Summit , together with their parameterscalculated as described above, using the time-aware summing model. In thediscussed simple workload case there is no signiﬁcant diﬀerence between thevalues derived using the two diﬀerent models. The values in the ﬁgures aregiven in units of payload performance ratios, the values in the table are givenin units of the respective α contributions. The numbers can only be comparedas discussed below.The Ef f and Ef f are calculated from the corresponding published R Max /R P eak values. Amdahl’s parameter is calculated using Eq. (5), for the

F P F P F P · F P F P . F P

16 = 0 . Summit

F P . F P

16 = 0 . F P F P F P · F P Fugaku FP . F P

16 = 0 . F P FP FP · F P FP . FP

16 = 0 . F P F P F P · FP Fig. 7

Left: The serial summing of payload and non-payload contributions (assumes noparallelization and/or no blocking. Right: The parallel summing of payload and non-payloadcontributions (assumes parallelization and enables to consider mutual blocking)

Table 1

Floating point characteristics of supercomputers

F ugaku and

Summit

Name

Eff Eff (1 − α ) (1 − α ) T ime T ime F F Fugaku 0.808 0.691 3.25e-8 6.12e-8 3.25e-8 1.79e-8

Summit 0.74 0.557 14.7e-8 33.2e-8 14.7e-8 11.0e-8 two diﬀerent operand lengths. As discussed, (1 − α ) is the sequential portionof the total measurement time (aka non-payload to payload ratio). Assumingthat the total measurement time is the time unit, the limiting time of per-forming a ﬂoating operation with 64-bit operands T ime (on our arbitrarytime scale) is directly derived from the (1 − α ) value. To get a T ime valueon the same scale, the measured (1 − α ) value must be corrected for theirdiﬀering measurement time (the measured performance ratios 3.42 and 3.01,respectively).The absolute values of data in the last two columns of the table shall not becompared directly with those of the other supercomputer (their measurementtime was diﬀerent). Given that the task and the computing model were thesame, we can directly compare ratios of values F and F . The proportions ofboth F values and F /F values show that housekeeping is much better for F ugaku than for

Summit . Given that their architectures are globally similar,the plausible reason for the diﬀerence in their eﬃcacy (and performance) isthat in the case of

Summit , the processor core is in a role of proxy (andin this way it represents a bottleneck), while

F ugaku uses ”assistant cores”.Housekeeping increases latency and signiﬁcantly decreases performance of thesystem. The plausible reason for their better F values is the clever use (andpositioning! see [19] ) of L2-level cache memories.In the case of Summit , we also know its HPCG eﬃciency. In its casethe

T ime

HP CG value is 2 . · e −

5, i.e. several hundred times higher than

T ime

HP L . Given that F and F are the same in the case of the two bench-marks, the diﬀerence is caused by F . In the long run, the diﬀerent workload he many eﬃciencies of supercomputers 23 (iteration, more intensive communication, ”sparse” computation forcing dif-ferent cache utilization) forces diﬀerent F value, and that leads to ”diﬀerenteﬃciencies” [14] of supercomputers under diﬀerent workloads .In the case of F ugaku , only a fraction of its cores was used in the bench-mark HPCG, so only the achieved payload performance (but not its eﬃciency)can be validated. It is very plausible that HPCG performance reached itsrooﬂine, and (because of the higher number of cores), its real HPCG eﬃciencywould be around that of

T aihulight . Anyhow, it would not be fair either toassume that speculation or to accept their value measure at a diﬀerent numberof cores. There is no measured data.These data directly underpin that technology is (almost) perfect: contri-bution from benchmark calculation

HPCG-FP64 is by orders of magnitudelarger than the contribution from all the rests. Recalling that that bench-mark program imitates the behavior (as deﬁned by the resulting α ) of real-lifeprograms, one can see that the contribution from other computing-related ac-tors is about a thousand times smaller than the contribution of the computa-tion+communication .The unique role of ”mixed precision” eﬃciencies (a third kind of eﬃcienciesof a supercomputer), see the red ’x’ marks in Fig. 6, deserves special attention.Strictly speaking, the points cannot be correctly positioned in the ﬁgure; theybelong to a diﬀerent scale (they are measured on a diﬀerent HW). On one side,the same number of operations are performed, using the same amount of PUs.On the other side, four times less data are transferred and manipulated. Thenominal performance is expected to be four times higher than in the case ofusing double-precision operands. Without correcting for the more than threetimes shorter measurement (see below), the eﬃciency mark is slightly abovethe corresponding value measured with double length operands (the relativeweight of F is higher), with correction, it is somewhat below it. This diﬀer-ence, however, is noticeable only with benchmark HPL; in the case of HPCGworkload, computation (including operand length) has a marginal eﬀect. In theformer case, contributions of computation and communication are in the samerange of magnitude, and they are competing for dominating the performanceof the system. In the latter case, communication dominates performance; com-puting has a marginal role. This is the reason why no data are available forthe HPCG benchmark using half-precision operands.Fig. 8 illustrates the role of non-payload performance with respect tooperand length. In the ﬁgure (for visibility) a hypotethic ratio 10 of eﬃcienciesmeasured by benchmarks HP L and

HP CG was assumed. The non-payloadcontribution blocks the operation of the ﬂoating point unit. The dominatingrole of non-payload contribution also means, that it is of little importanceif double or half precision operands are used in the computation. The bluevectors essentially represent the case of

Summit , the red vector representsthe

F P value of F ugaku (transformed to scaling of

Summit ). The diﬀerencebetween their

HP L performances is attributed to their diﬀerent

F P values. Recall here that cache behavior may be included4 J´anos V´egh

HPL and HPCG timingSummit and Fugaku implementation FP FP · FP FP non-payload FPHPCG FPHPCG Fig. 8

The role of non-payload contribution in deﬁning

HP L and

HP CG eﬃciencies, fordouble and half precision ﬂoating operation modes. For visibility, a hypothetic eﬃciencyratio E HPL /E HPCG =10 assumed. The housekeeping (including transfer time and cachemisses) dominates, the length of operands has only marginal importance.

This diﬀerence, however, gets marginal as the workload approaches real-lifeconditions.4.5 Further eﬃciency valuesThe performance corresponding to α F P HP L is slightly above 1 EFlops (whenmaking no ﬂoating operations, i.e., rather Eops). Another peak performancereported when running genomics code on Summit (by using a mixture ofnumerical precisions and mostly non-ﬂoating point instructions) is 1.88 Eops,corresponding to α F P Genom = 1 ∗ − . Given that those two values refer to adiﬀerent mixture of instructions, the agreement is more than satisfactory.Our simple calculations result in, that in the case of benchmark HPL , F P values are in the order of F P , and that benchmark HPL is computing bound:reducing housekeeping (including communication) has some sense for ”rac-ing supercomputers”, for real-life applications it has only marginal eﬀect . Onthe other side, increasing the housekeeping (more communication such as inthe case of benchmark HPCG or ANNs) degrades the apparent performance.At suﬃciently large amount of communication [55] , housekeeping dominatesthe performance, and contribution of F P X becomes marginal. For large ANNapplications, using F P operands makes no real diﬀerence, their workload deﬁnes their performance (and eﬃciency). The ”commodity supercomputers”can achieve the same payload performance, although they have much lowernumber of PUs: the ”racing supercomputers” either cannot use their impres-sively vast number of cores at all, or using all their cores does not increasetheir payload performance in solving real-life tasks. times worse (1 − α eff ) andperformance gain values. The dominant limiting factor, however, is a diﬀerentone.In brain simulation, a 1 ms integration time (essentially sampling time)is commonly used [26]. The biological time (when events happen) and thecomputing time (how much computing time is required to perform computingoperations to describe the same) are not only diﬀerent, but also not directlyrelated. Working with ”signals from the future” must be excluded. For thisgoal, at the end of this period, the calculated new values of the state variablesmust be communicated to all (interested) fellow neurons. This action essen-tially introduces a ”biology clock signal period”, being a million-fold longerthan the electronic clock signal period. Correspondingly, the achievable per-formance is desperately low: less than 10 neurons can be simulated, out ofthe planned 10 [26] . For a detailed discussion see [27, 19, 55] .Figure 9 depicts the experimental equivalent of Figure 5.In [26], their power consumption eﬃciency was also investigated. It ispresumed that (to avoid obsolete energy consumption), they performed themeasurement at the point, where involving more cores increases the powerconsumption but does not increase payload simulation performance. This as-sumption resulted in the ”reasoned guess” for the eﬃciency of brain simulationin Figure 5. As using AI workload (for a discussion from this point of viewsee [57] ) on supercomputers is of growing importance, performance gain of a Despite its failure, the SpiNNaker2 is also under construction [21]6 J´anos V´egh Year P e r f o r m a n c e g a i n st by R HP LMax nd by R HP LMax rd by R HP LMax

Best by α

HP Leff st by R HP CGMax nd by R HP CGMax rd by R HP CGMax

Best by R

HP CGMax

Brain simulation

Fig. 9

Performance gains of supercomputers modeled as ”rooﬂine” [40], as measured withbenchmarks HPL and HPCG (data taken from database TOP500 [41]), and the one for brainsimulation is concluded from [26]. The red pentagons denote performance gain, measuredusing half-precision operands. processor-based

AI application can be estimated to be between those of HPCGand brain simulation, closer to that of HPCG. As discussed experimentallyin [58] and theoretically in [57] , in the case of neural networks (especiallyin the case of selecting improper layering depth) the eﬃciency can be muchlower , than that estimated value.Recall, that since AI nodes usually perform simple calculations compared tofunctionality of supercomputer benchmarks, their communication/calculationratio is much higher, making the eﬃcacy even worse. Our conclusions areunderpinned by experimental research [58]: – ”strong scaling is stalling after only a few dozen nodes” – ”The scalability stalls when the compute times drop below the communica-tion times, leaving compute units idle. Hence becoming an communicationbound problem.” artiﬁcial intelligence, . . . it’s the most disruptive workload from an I/O patternperspective ”he many eﬃciencies of supercomputers 27 − − − . . . N o o f c o r e s (1 − α eff ) E ff i c i e n c y Dependence of E HPL on (1 − α eff ) and N Piz Daint2012/112013/062013/112016/112017/062018/11 − − − − − − − − R Peak (exaFLOPS) R M a x ( e x a F L O P S ) Development of R HPLMax for

PizDaint

Supercomputer

Xeon E5-2690 + NVIDIA Tesla P100 (2018)Xeon E5-2690 + NVIDIA Tesla P100 (2017)Xeon E5-2690 + NVIDIA Tesla P100 (2016)Xeon E5-2670 + NVIDIA K20x (2013)Xeon E5-2670 (2013)Xeon E5-2670 (2012)

Fig. 10

The history of supercomputer Piz Daint in terms of eﬃciency and payload per-formance [50]. The left subﬁgure shows how the eﬃciency changed as developers proceededtowards higher performance. The right subﬁgure shows the reported performance data (thebubbles), together with diagram line calculated from the value as described above. Comparevalue of diagram line to measured performance data in the next reported stage. – ”network layout has a large impact on the crucial communication/computeratio: shallow networks with many neurons per layer . . . scale worse thandeep networks with less neurons.”The massively ”bursty” nature of data (diﬀerent nodes of the layer wantto use the communication subsystem at the same moment) also makes thecase harder. The commonly used global bus is overloaded with messages (fora detailed discussion see [19] ), that may lead to a ”communicational collapse”(demonstrated in Figure 5.(a) in [59]): at an extremely large number of cores,exceeding the critical threshold of communication intensity, leads to unex-pected and drastic change of network latency. As the value of parameters of our model are inferred from non-dedicated single-shot measurements, their reliability is limited. One can verify, however, howour model predicts values derived from later measurements. Supercomputersusually do not have a long lifespan and several documented stages. One of therare exceptions is supercomputer

Piz Daint . Its documented lifetime spansover six years. At that period, diﬀerent amounts of cores, without and withacceleration, using diﬀerent accelerators, were used.Figure 10 depicts performance and eﬃciency values published during itslifetime, together with diagram lines predicting (at the time of making theprediction) values at higher nominal performance values. The left subﬁgureshows how changes made in the conﬁguration aﬀected its eﬃciency (the time-line starts in the top right corner, and a line connects adjacent stages).

In the right subﬁgure, bubbles represent data published in adjacent edi-tions of the TOP500 lists, the diagram lines crossing them are predictions madefrom that snapshot. The predicted value shall be compared to the value pub-lished in the next list. It is especially remarkable that introducing GPGPUacceleration resulted only in a slight increase (in good agreement with [60]and [15]) compared to the value expected based purely on the increase in thenumber of cores. Although between our ”samplings” more than one parame-ter was changed, that is, the net eﬀect cannot be demonstrated clearly, themeasured data suﬃciently underpin our limited validity conclusions and showthat theory correctly describes the tendency of development of performanceand eﬃciency, and even its predicted performance values are reasonably accu-rate.Introducing GPU accelerator is a one-time performance increasing step [15],and cannot be taken into account by theory. Notice, that introducing accel-erator, increased payload performance, but decreased eﬃciency (copying datafrom one address space to another one increases latency). Changing the accel-erator to another type with slightly higher performance (but higher latencydue to its larger GPGPU memory) caused a slight decrease in the absoluteperformance because of the considerably dropped eﬃciency.

As detailed above, our theoretical model enables us to calculate the payloadperformance in ﬁrst-order approximation at any nominal performance value.In the light of all of this, one can estimate a short time and a longer timedevelopment of supercomputer performance, see Figs. 11.A and 11.B. The di-agram lines are calculated using Eq. (4), with α parameter values derived fromTOP500 data of Summit supercomputer; the bubbles show measured values.The diagram lines from the bottom up show the double ﬂoating precisionHPCG, HPL and the half precision [53] HPL (HPL-AI) diagrams. Given thatour parameter values are calculated from a snapshot, and that the calculationis essentially an extrapolation, furthermore that at high nominal performancevalues using the second-order approximation is more and more pressing, pre-dictions shown in Figure 11 are rough and very optimistic approximations,but somewhat similar to real upper limit values.In addition to measured and published performance data, two more dia-gram lines representing two more calculated α values are also depicted. The’FP0’ (orange) diagram line is calculated with the assumption that super-computer makes the stuﬀ needed to perform the HPL benchmark, but actualFP operations are not performed. In other words, the computer works withzero-bit length ﬂoating operands (FP0) .The ’Science’ (red) diagram line is calculated with the assumption thatnothing is calculated, but science (the ﬁnite propagation time due to the ﬁ- The role of α FP HPL is akin to the execution time of the ”empty loop” in programming.he many eﬃciencies of supercomputers 29 − − − − − − − − Nominal performance (EFlops) P a y l oa dp e r f o r m a n ce ( E F l o p s ) Payload performances @Summit alpha = HPCG − FP alpha = HPL − FP alpha = HPL − FP alpha = HPL − FP alpha = Science − − − − − − Nominal performance (EFlops) P a y l oa dp e r f o r m a n ce ( E F l o p s ) Payload performances @Exascale alpha = HPCG − FP alpha = HPL − FP alpha = HPL − FP alpha = HPL − FP ,Fugaku alpha = Science

Fig. 11

Tendency of development of payload performance, in the near and farther futureof supercomputing. The parameters of

Summit are used for illustration, and for comparisonthe double precision performance diagram line of

F ugaku is also displayed. The diagramlines are calculated from the theory, the marks are the measured values from the databaseTOP500 [50] and [53]. nite speed of light) limits payload performance . The nonlinearity of payloadperformance around the Eﬂops nominal performance is visible, and dependsboth on the amount of computing+communication and nominal performance(represented by the number of cores).Figure 11B shows the farther future (in ﬁrst-order approximation): towardsZﬂops [7]. No surprise that all payload performance diagram lines run into sat-uration, even the ’FP0’ and ’Science’ ones. For comparison, double-precisionpayload performance of F ugaku is also displayed. Recall that the diagramlines are calculated in the ﬁrst-order approximation. In the second-order ap-proximation, it is expected that diagram lines reach their inﬂection point andbreak down. These top supercomputers are near to that point; this is whytheir development stalled.

The analogies in this section do not want to imply direct correspondence be-tween certain physical and computing phenomena. Instead, we want to drawthe attention to both that under extreme conditions qualitatively diﬀerent be-havior may be encountered, and that scrutinizing certain, formerly unnoticed orneglected aspects, enables us to explain the new phenomena . Unlike in nature,in computing, the technical implementation of critical points can be changed,and through this, the behavior of computing systems can also be altered. Nosystematic discussion is given here, for detailed discussion and more detailssee [4] .

100 m cable length was assumed, which means 10 processors pro cm and some GWdissipation.0 J´anos V´egh / (1 − α ); a well-known consequenceof Amdahl’s statement. Due to this, the computing performance of a sys-tem cannot be increased above the performance deﬁned by single-processorperformance, parallelization technology, and number of processors . Laws ofnature prohibit to exceed a speciﬁc computing performance (using the clas-sical paradigm and its classical implementation). There is an analogy betweenadding speeds in physics and adding performances in computing.

In both cases,a correction term is introduced, that provides a noticeable eﬀect only at ex-tremely large values. It seems to be an interesting parallel, that both natureand extremely cutting-edge technical (computing) systems show up some ex-traordinary behavior, i.e., the linear behavior experienced under normal condi-tions gets strongly non-linear at large values of the independent variable . Thatbehavior makes validity of linear extrapolations as well as linear addition ofperformances at least questionable at high performance values in computing.7.2 Quantal nature of computing timeIn computing, the cooperation of components needs some kind of synchroniza-tion. Typically, a central clock (with clock domains and distribution trees) areused. In today’s technology, the length of the clock cycle is in the ns range,so they seem to be quasi-continuous for the human perception. In some ap-plications, for example when attempting to imitate the parallel operation ofthe nervous system, the non-continuous nature of computing time comes tothe light. In that case the independently running neurons must be put back totheir biological scale, i.e., they stop at the end of the ”integration grid time”values, and distribute their result to the peer neurons. This introduces a ”bi-ological clock time”, which is about a million-fold longer than the ”processorclock time”, and limits the achievable parallelization gain. The eﬀect is no-ticeable only at vast number of cores. The limit is conceptual, but made muchworse with the ”communication burst” the appearance of the simultaneousneed of communication in the neural networks represent. For details see [27].7.3 ’Quantum states’ of supercomputersAs discussed in detail in [4] , the behavior of computing systems under extremeconditions shows surprising parallels with the behavior of natural objects. Re-ally, ”More Is Diﬀerent” [28]. The behavior of supercomputers is somewhatanalogous with the behavior of quantum systems, where the measurementselects one of its possible states (and at the same time, kills all other possi-ble states). In computing, a supercomputer –as a general-purpose computingsystem– has the potential of high performance deﬁned by the impressive pa-rameters of its components . However, when we run a computation (that is, we he many eﬃciencies of supercomputers 31 measure the computing performance of our computer), that workload selectsthe best possible combination of limitations that deﬁnes performance and killsall other potential performances .The logical dependence of operation of the components implicitly alsomeans their temporal dependence [19] , and it introduces idle times to com-puting. In this way, the workload deﬁnes how much of those potential abilitiescan be used: the datasheet values represent hard limits, and their workloadsets the soft limits; given that the components block each other’s operation. The workload deﬁnes a ﬁll-out factor: it introduces diﬀerent idle times intothe operation of components, and in this way, forces workload-deﬁned soft per-formance limits to the components of the supercomputer . Diﬀerent workloadsforce diﬀerent limitations (use their available resources diﬀerently), giving anatural explanation of their ”diﬀerent eﬃciencies” [14]. In other words, run-ning some computation destroys the potentially achievable high performance,deﬁned individually by its components .Benchmarking such computing systems introduces one more limiting com-ponent: the needed computation. For ﬂoating-point computations, the ’bestpossible’ (producing the highest ﬁgures of merit) benchmark is HPL. Withthe development of parallelization and processor technology, ﬂoating compu-tation itself became the major contributor in deﬁning the eﬃciency and per-formance of the system. Since the benchmark measurement method itself is acomputation, the best measurable ﬂoating payload performance value cannotbe smaller than the value that the benchmark procedure itself represents.For real-life programs (such as HPCG) their workload-deﬁned performancelevel (saturation value) are already set well below the Eﬂops nominal perfor-mance, see Fig. 2. Further enhancements in technology, such as tensor pro-cessors and OpenCAPI connection bus, can slightly increase their saturationlevel but cannot change the science-deﬁned shape of the diagram line.

Supercomputers reached their technical limitations, their development is out ofsteam. To continue enhancing the components of a supercomputer that wantsto run any calculation, without changing its underlying computing paradigm,is not worth any more. To enter the ”next level”, really renewing the classiccomputing paradigm is needed [61–63, 19] .The ironic remark that ’Perhaps supercomputers should just be required tohave written in small letters at the bottom on their shiny cabinets: “Objectmanipulations in this supercomputer run slower than they appear.” [14]’ is be-coming increasingly relevant. The impressive numbers about the performanceof their components (including single-processor and/or GPU performance andspeed of interconnection) are becoming less relevant when going to the ex-tremes. Given that the most substantial α contribution today takes its originin the computation the supercomputer runs, even the best possible benchmarkHPL dominates ﬂoating performance measurement. Enhancing other contri- butions, such as interconnection, result in marginal enhancement of their pay-load performance, i.e., the overwhelming majority of expenses increase their”dark performance” only. Because of this, the answers to the questions in thetitle are: there are as many performance values as many measurement meth-ods (that can be varied with how big portion of available cores are used inthe measurement) , and actually benchmarks measure mainly how much math-ematics/communication the benchmark program does, rather than the super-computer architecture (provided that all components deliver their technicallyachievable best parameters). References

1. S. H. Fuller and L. I. Millett, Eds.,

The Future of Computing Performance: Game Overor Next Level?

National Academies Press, Washington, 2011.2. G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-ScaleComputing Capabilities,” in

AFIPS Conference Proceedings , vol. 30, 1967, pp. 483–485.3. J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for multiproces-sors: Methodology and examples,”

Computer , vol. 26, no. 7, pp. 42–50, Jul. 1993.4. J. V´egh and A. Tisan, “The need for modern computing paradigm: Science applied tocomputing,” in

Computational Science and Computational Intelligence CSCI The 25thInt’l Conf on Parallel and Distributed Processing Techniques and Applications . IEEE,2019, pp. 1523–1532. [Online]. Available: http://arxiv.org/abs/1908.026515. J. V´egh, “The performance wall of parallelized sequential computing: the rooﬂineof supercomputer performance gain,”

Parallel Computing , vol. in review, p.http://arxiv.org/abs/1908.02280, 2019.6. I. Markov, “Limits on fundamental limits to computation,”

Nature , vol. 512(7513), pp.147–154, 2014.7. Liao, Xiang-ke et al, “Moving from exascale to zettascale computing: chal-lenges and techniques,”

Frontiers of Information Technology & ElectronicEngineering

Science

Nature , vol. 551, pp. 554–556,2017.14. IEEE Spectrum, “Two Diﬀerent Top500 Supercomputing BenchmarksShow Two Diﬀerent Top Supercomputers,” https://spectrum.ieee.org/tech-talk/computing/hardware/two-diﬀerent-top500-supercomputing- benchmarks-show-two -diﬀerent-top-supercomputers, 2017.15. H. Simon, “Why we need Exascale and why we won’t get there by 2020,”in

Exascale Radioastronomy Meeting

Commun. ACM , vol. 31, no. 5, pp.532–533, May 1988.17. S. Krishnaprasad, “Uses and Abuses of Amdahl’s Law,”

J. Comput. Sci. Coll.

Frontiers in Neuroscience

Frontiers in Neuroinformatics , vol. 8, p. 78, 2014.25. S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, andA. D. Brown, “Overview of the SpiNNaker System Architecture,”

IEEE Transactionson Computers , vol. 62, no. 12, pp. 2454–2467, 2013.26. S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B. Stokes, D. R.Lester, M. Diesmann, and S. B. Furber, “Performance Comparison of the Digital Neuro-morphic Hardware SpiNNaker and the Neural Network Simulation Software NEST fora Full-Scale Cortical Microcircuit Model,”

Frontiers in Neuroscience , vol. 12, p. 291,2018.27. J. V´egh, “How Amdahl’s Law limits performance of large artiﬁ-cial neural networks,”

Brain Informatics , vol. 6, pp. 1–11, 2019. [On-line]. Available: https://braininformatics.springeropen.com/articles/10.1186/s40708-019-0097-2/metrics28. P. W. Anderson, “More Is Diﬀerent,”

Science , vol. 177, pp. 393–396, 1972.29. D. Patterson and J. Hennessy, Eds.,

Computer Organization and design. RISC-V Edi-tion . Morgan Kaufmann, 2017.30. R. Waser, Ed.,

Advanced Electronics Materials and Novel Devices , ser. Nanoelectronicsand Information Technology. Wiley, 2012.31. K. Hwang and N. Jotwani,

Advanced Computer Architecture: Parallelism, Scalability,Programmability , 3rd ed. Mc Graw Hill, 2016.32. V. Weaver, D. Terpstra, and S. Moore, “Non-determinism and overcount on modernhardware performance counter implementations,” in

Performance Analysis of Systemsand Software (ISPASS), 2013 IEEE International Symposium on , April 2013, pp. 215–224.33. P. Moln´ar and J. V´egh, “Measuring Performance of Processor Instructions and Oper-ating System Services in Soft Processor Based Systems,” in , 2017, pp. 381–387.34. F. Ellen, D. Hendler, and N. Shavit, “On the Inherent Sequentiality of ConcurrentObjects,”

SIAM J. Comput. , vol. 43, no. 3, p. 519-536, 2012.35. L. Yavits, A. Morad, and R. Ginosar, “The eﬀect of communication and synchronizationon Amdahl’s law in multicore systems,”

Parallel Computing , vol. 40, no. 1, pp. 1–16,2014.36. J. V´egh and P. Moln´ar, “How to measure perfectness of parallelization in hard-ware/software systems,” in , 2017, pp.394–399.4 J´anos V´egh37. F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Cooperative computingtechniques for a deeply fused and heterogeneous many-core processor architecture,”

Journal of Computer Science and Technology , vol. 30, no. 1, pp. 145–162, Jan 2015.38. M. Mohammadi and T. Bazhirov, “Comparative Benchmarking of Cloud ComputingVendors with High Performance Linpack,” in

Proceedings of the 2Nd InternationalConference on High Performance Compilation, Computing and Communications , ser.HP3C. New York, NY, USA: ACM, 2018, pp. 1–5.39. A. H. Karp and H. P. Flatt, “Measuring Parallel Processor Performance,”

Commun.ACM , vol. 33, no. 5, pp. 539–543, May 1990.40. S. Williams, A. Waterman, and D. Patterson, “Rooﬂine: An insightful visual perfor-mance model for multicore architectures,”

Commun. ACM

Proceedings of the 3rd ACM/SPEC Interna-tional Conference on Performance Engineering

Proceedings of the 2007 Workshop on ExperimentalComputer Science , ser. ExpCS ’07. New York, NY, USA: ACM, 2007, pp. 3–3.46. F. M. David, J. C. Carlyle, and R. H. Campbell, “Context Switch Overheads for Linuxon ARM Platforms,” in

Proceedings of the 2007 Workshop on Experimental ComputerScience , ser. ExpCS ’07. New York, NY, USA: ACM, 2007. [Online]. Available:http://doi.acm.org/10.1145/1281700.128170347. J. V´egh, J. V´as´arhelyi and D. Dr´otos, “The performance wall of large parallelcomputing systems,” in

Lecture Notes in Networks and Systems 68 . Springer, 2019,pp. 224–237. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-030-12450-2 2148. J. V´egh, “How Amdahl’s law restricts supercomputer applications and buildingever bigger supercomputers,”

CoRR , vol. abs/1708.01462, 2018. [Online]. Available:http://arxiv.org/abs/1708.0146249. T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “Constructing NeuronalNetwork Models in Massively Parallel Environments,”

Frontiers in Neuroinformatics

TheInternational Journal of High Performance Computing Applications

Proceedings of the International Conference for High Performance Computing, Net-working, Storage, and Analysis , ser. SC ’18. IEEE Press, 2018, pp. 47:1–47:11.54. Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance Optimizationof the HPCG Benchmark on the Sunway TaihuLight Supercomputer,”

ACM Trans.Archit. Code Optim. , vol. 15, no. 1, pp. 11:1–11:20, Mar. 2018.55. J. V´egh, “Which scaling rule applies to Artiﬁcial Neural Networks,” in

Com-putational Intelligence (CSCE) The 22nd Int’l Conf on Artiﬁcial Intelligence(ICAI’20)

How deep machine learning can be , ser. A Closer Look at ConvolutionalNeural Networks. Nova, In press, 2020, pp. 141–169. [Online]. Available:https://arxiv.org/abs/2005.0087258. J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural Networks:Theoretical and Practical Limits of Parallel Scalability,” in

Journal of Physics D: Applied Physics , vol. 52,no. 1, p. 014003, oct 2018.60. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish,M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey,“Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computingon CPU and GPU,” in

Proceedings of the 37th Annual International Symposium onComputer Architecture , ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 451–460.[Online]. Available: http://doi.acm.org/10.1145/1815961.181602161. J. V´egh,

Renewing computing paradigms for more eﬃcient parallelization of single-threads , ser. Advances in Parallel Computing. IOS Press, 2018, vol. 29, ch. 13, pp.305–330. [Online]. Available: https://arxiv.org/abs/1803.0478462. J. V´egh, “Introducing the Explicitly Many-Processor Approach,”

Parallel Computing ,vol. 75, pp. 28 – 40, 2018.63. ——, “How to extend the Single-Processor Paradigm to the Explicitly Many-ProcessorApproach,” in2020 CSCE, Fundamentals of Computing Science