[PDF] A figure of merit for describing the performance of scaling of parallelization

Abstract

With the spread of multi- and many-core processors more and more typical task is to re-implement some source code written originally for a single processor to run on more than one cores. Since it is a serious investment, it is important to decide how much efforts pays off, and whether the resulting implementation has as good performability as it could be. The Amdahl's law provides some theoretical upper limits for the performance gain reachable through parallelizing the code, but it needs the detailed architectural knowledge of the program code, does not consider the housekeeping activity needed for parallelization and cannot tell how the actual stage of parallelization implementation performs. The present paper suggests a quantitative measure for that goal. This figure of merit is derived experimentally, from measured running time, and number of threads/cores. It can be used to quantify the used parallelization technology, the connection between the computing units, the acceleration technology under the given conditions, communication method within SoC, or the performance of the software team/compiler.

Full PDF

aa r X i v : . [ c s . PF ] J u l A ﬁgure of merit for describingthe performance of scaling of parallelization

J´anos V´egh a , P´eter Moln´ar b , J´ozsef V´as´arhelyi a a University of Miskolc, Hungary b PhD School of Informatics, University of Debrecen, Hungary

Abstract

With the spread of multi- and many-core processors more and more typicaltask is to re-implement some source code written originally for a single pro-cessor to run on more than one cores. Since it is a serious investment, it isimportant to decide how much eﬀorts pays oﬀ, and whether the resulting im-plementation has as good performability as it could be. The Amdahl’s lawprovides some theoretical upper limits for the performance gain reachablethrough parallelizing the code, but it needs the detailed architectural knowl-edge of the program code, does not consider the housekeeping activity neededfor parallelization and cannot tell how the actual stage of parallelization im-plementation performs. The present paper suggests a quantitative measurefor that goal. This ﬁgure of merit is derived experimentally, from measuredrunning times, and number of threads/cores. It can be used to quantify theused parallelization technology, the connection between the computing units,the acceleration technology under the given conditions, or the performanceof the software team/compiler.

Keywords: multi-core, parallelization, performance, scaling, ﬁgure of merit

1. Introduction

The computer manufacturing technology is not any more able to producequicker processors, see Agarwal et al. (2000). The crisis of the computing

Email addresses:

[email protected] (J´anos V´egh), [email protected] (P´eter Moln´ar), [email protected] (J´ozsef V´as´arhelyi)

Preprint submitted to JPDC July 7, 2018 xperienced since cca. year 2000, see Fuller and Millett (2011), increased thedemand towards parallel computing.On hardware side: ”

Processor and network architectures are making rapidprogress with more and more cores being integrated into single processors andmore and more machines getting connected with increasing bandwidth. Pro-cessors become heterogeneous and reconﬁgurable. ”, see S(o)OS project (2010).On software side: ” parallel programs ... are notoriously diﬃcult to write, test,analyze, debug, and verify, much more so than the sequential versions ”, seeYang et al. (2014). In addition, the typical real-life programs show com-plex parallelization behavior, see Yavits et al. (2014), and also the appar-ently massively parallel algorithms can behave extremely ineﬀectively, seePingali et al. (2011). Diﬀerent merits are derived for characterizing paral-lel systems, from simple speedup to price per performance Karp and Flatt(1990).Because all of these diﬃculties, demands and possibilities, uncountabledevelopments are running and planned to re-implement some existing code,written having a single processor in mind, for the already ubiquitos multi-core processors. To ﬁnd out whether it is worth to invest in such eﬀorts forparallelization, as well as to decide where to stop the development, one needssome quantitative measure of the parallelism. Even today, in the ”multicoreera”, see Hill and Marty (2008), the performance is described by (a mod-iﬁed version of) Amdahl’s law, see Amdahl, G. M. (1967). Unfortunately,Amdahl’s law provides no support for the goals mentioned: it needs infor-mation on code architecture, makes assumptions which are not any morevalid for modern accelerated processors and applications heavily using op-erating system services. However, scrutinizing the conditions and revertingthe Amdahl’s formula, it is possible to derive such a ﬁgure of merit.

2. LIMITATIONS OF PARALLELIZATION

Amdahl’s considerations focus on the fact that some parts ( P i ) of a codecan be parallelized, some ( S i ) must remain sequential. He also mentionedthat data housekeeping causes some overhead, which in his paper was esti-mated to be in the range 20% to 40%, and that the nature of this overheadappears to be sequential .Although Amdahl just wanted to draw the attention to the limitations ofthe single-processor approach applied to large-scale computing, his followers2 ) Classic Amdahl caseSerialP roc T i m e ( a r b . u n i t s ) P S P P P S P arallelP P P P P S P P P S T total = P i S i + P i P i α = P i P i P i S i + P i P i ) = 0 . S − = (1 − α ) + α/k = 0 . B ) Realistic Amdahl caseSerialP roc T i m e ( a r b . u n i t s ) P S P P P S P arallelP P P P P C W C S P P P S T total = P i S i + P i C i + P max S − = 0 . ⇒ α eff = kk − S − S α eff =

32 10 / − / = 0 . • the parallelized parts are of equal length in terms of execution time • the housekeeping (controling the parallelization, passing parameters,exchanging messages, etc.) has no costs in terms of execution time • the number of parallelizable chunks coincides with the number of avail-able computing resourcesEssentially, this is why Amdahl’s law represents a theoretical upper limit forparallelization gain . In Fig 1 the left side shows the idealistic case wherethe original process in the single-processor system comprises the sequentialonly parts S i , and the parallelizable parts P i . One can also see that thecontrol components C i are of the same nature as S i , the non-parallelizablecomponents. This also means that even in the idealistic case when S i arenegligible, C i will represent a bound for parallelization. From the ﬁgure themeaning of α is: in what fragment of time (in terms of the time needed for3ompletely sequential processing) the helper processors are utilized. Whenthe task is scaled to several processors, the goal is obviously to maximize theutilization of the helper processors.The realistic case (shown in the right side of Fig 1) however, is, that theparallelized parts are not of equal length (even if they contain exactly thesame instructions, the hardware operation in modern processors may exe-cute them in considerably diﬀerent times (for examples see the operation ofhardware accelerators inside a core or the network operation between pro-cessors, etc.) and also that the time required to control parallelization is notnegligible and varying. Here α eff provides a value for an average utilization of the helper cores. Obviously, the unused cores and the unbalanced load ofcores degrades this average utilization. To characterize the eﬀects like sharingthe processing between diﬀerent number of helper cores ( the performance ofthe scaling ) or using diﬀerent hardware conditions, one needs a quantitativeﬁgure of merit.The ﬁgure also calls the attention to the fact that the static correspon-dence between program chunks and processing units can be very ineﬃcient:all assigned processing units must wait for the delayed unit and also the ca-pacity is lost if the number of computing resources exceeds the number ofthe parallelized chunks. Usually, Amdahl’s law is expressed with the formula S − = (1 − α ) + α/k (1)where k is the number of parallelized code fragments, α is the ratio of theparallelizable part to the total sequential part, S is the measurable speedup.The assumption can be visualized that (assuming many processing units) in α fraction of the running time the processors are processing, in (1- α ) fractionthey are waiting. I.e. α describes how much, in average, the processors areutilized. Having those data, the resulting speedup can be estimated.For a system under test, where α is not a priory known, one can derivefrom the measurable speedup S an eﬀective parallelization factor as α eff = kk − S − S (2)where S is now the measured speedup, and k is the number of the availablecores. Obviously, for the classical case, α = α eff ; which simple means that4n idealistic case the actually measurable eﬀective parallelization reaches thetheoretically possible one. In other words, α describes a system the architec-ture of which is completely known, α eff describes a system the performance of which is known from experiments. Again in other words, α is the theoret-ical upper limit , which can hardly be reached, while α eff is the experimentalactual value , which describes the complex architecture and the actual condi-tions. It is interesting to note, that α eff is an absolute measure of utilizingthe available processing capacity, see section 3. Numerically (1 − α eff ) equalswith the f value, established theoretically by Karp and Flatt (1990).The α eff can then be used to refer back to the Amdahl’s classical assump-tion even in the realistic case when the parallelized chunks have diﬀerentlength and the overhead to organize parallelization is not negligible. Notethat in case of real tasks a kind of Sequential/Parallel Execution Model, seeYavits et al. (2014), shall be applied, which cannot use the simple picturereﬂected by α , but α eff gives a good merit of the degree of parallelizationfor the duration of the execution of the process, and can be compared to theresults of the technology-dependent parametrized formulas.With our notations, in the classical Amdahl case on the left side in Fig. 1 S = P i S i + P i P i P i S i + max i P i = 2 (3)and α = α eff = P i P i P i S i + P i P i = 3 / S = 10 /

7, which results in α eff = 32 10 / − / .

45 (5)As seen, the overhead and the diﬀerent duration of the parallelized partsreduced the eﬀective parallelization drastically relative to the theoreticallyreachable value. Fig 2 gives a feeling on the eﬀect of the computer systembehaviour on the eﬀective parallelization. The middle region (marked byballs) is mentioned by Amdahl as typical range of overhead. The asteriskin the ﬁgure shows the ”working point” corresponding to the values used inFig 1. 5 . . . . . . . .

51 Overhead ratio Sequential ratio α e ff α eff k = 3 , max i P i = 0 . Figure 2: Behavior of the eﬀective parallelization α eff in function of theoverhead ratio (relative to the parallelizable payload execution length) andthe ratio of the sequential part (relative to the total sequential executiontime).One can see that the eﬀective parallelization drops quickly with bothincreasing overhead and sequential parts of the program. This fact drawsthe attention to the idea that through decreasing either the control time orthe sequential-only fraction of the code (or both), and utilizing the other-wise wasted processing capacity, a serious gain in the eﬀective paralleliza-tion can be reached . This was experienced in the ”dynamic” architecture byHill and Marty (2008). The timing analysis in Fig 1 can be applied to diﬀerent kinds of paral-lelizations, from the processor-level parallelization (instruction or data levelparallelization, in the nanoseconds range) to OS-level parallelization (includ-ing thread-level parallelization using several processors or cores, in the mi-croseconds range), to network-level (between networked computers, like grids,in the milliseconds range). The principles are the same, see David et al.(2013), independently of the kind of implementation. In agreement withYavits et al. (2014), housekeeping overhead is always present (and mainly6epends on the architectural solution), and remains a key question, althoughthe mains focus is always on reducing its eﬀect.The actual speedup (or the eﬀective parallelization) depends strongly onthe ’tricks’ used during implementation. Although HW and SW parallelismare interpreted diﬀerently, they even can be combined, see Chandy and Singaraju(1999), resulting in hybrid architectures. For those greatly diﬀerent archi-tectural solutions it is even more hard to interpret α , while α eff allows tocompare diﬀerent implementations (or the same implementation under dif-ferent conditions) in such cases, too.Notice that in all kinds of parallelization the relative overhead fulﬁlls theobservation made by Amdahl: for better performability the overhead timecannot exceed dozens of percents relative to the execution time of the par-allelized chunk. Between networked computers the control times C i are inthe millisecond range and a lot of data must be transmitted, so for makingparallelization eﬃcient, typically complete jobs (having duration sometimeseven in the minutes range) are sent to the other computers. When usingthread-level parallelizm through OS services inside the computer, the ”ex-penses” of organizing threads is in the order of thousands of instructions,and typically the length of the working threads is also at least in that orderor above. In hardware level parallelization, when using hyperthreading, thevalues of C i are in the range of 1 clock cycle. However, the program chunks P k are also in the order of a few clock cycles, i.e. in a comparable range.Similar holds for the speculative evaluation, the out-of-order evaluation, etc.

3. Practical applications

The good metric to select describing parallelism depends on many factors,see Karp and Flatt (1990). The newly introduced metric α eff describes howeﬀectively the computing task is distributed between the processing units.As outlined above, the control functionality as well as unequalities in cuttingprogram into equally long pieces (including data transfer between processingunits) degrade α eff . Since 1 − α gives the sequential-only part of the program,(1 − α eff ) is expected to describe the ratio of the total (even unintended)sequential part, i.e. it is a sensitive measure of disturbances of parallelization.Since a larger load imbalance results in a larger decrease in value of α eff (ascan be concluded from Karp and Flatt (1990), α eff is a kind of derivativeof relative speedup), problems can be identiﬁed better than from speedupvalues. If the parallelization is well-organized (load balanced, small overhead,70 . . . .

81 Number of processors Sp ee dup / N oo f p r o ce ss o r s Audio stream 1Audio stream 2Radar initialRadar improved 2 4 6 810 − − − Number of processors - α e ff Audio stream 1Audio stream 2Radar initialRadar improvedFigure 3: Relative speedup (left side) and (1 − α eff ) (right side) values,measured running the audio and radar processing on diﬀerent number ofcores. Sheng et al. (2014)right number of processors), α eff saturates at unity, so tendencies can bebetter displayed through using (1 − α eff ). In paper Sheng et al. (2014) a compiler making eﬀective parallelizationof an originally sequential code for diﬀerent number of cores is describedand validated by running the executable code on platforms having the cor-responding number of cores. Let us apply Equ. (2) to their results, shownin Figs 8 and 10 in their paper.Fig. 3 left side displays eﬃciency (Eﬃciency = speedup divided by thenumber of cores) in function of number of cores for two diﬀerent processingsof audio streams, and for two processings of radar signals. The data displayedin the ﬁgures are derived simply through reading back diagram values fromthe mentioned ﬁgures in Sheng et al. (2014), so they may be not accurate.However, they are accurate enough to support our conclusions.Based on their merit, the authors of Sheng et al. (2014) can only declarea qualitative statement, that the ’eﬃciency’ decreases less steadily (dots onthe ﬁgure) with the growing number of cores ”The higher number of parallelprocesses in Audio-2 gives better results”), if they consider load balancing.It can surely be stated, that the improvement was successful: in bothcases the decrease with increasing number of cores is less steep. The dia-grams cannot tell, however, whether further improvements are possible or80 , , , ,

81 Number of processors Sp ee dup / N oo f p r o ce ss o r s Cray Y-MP/8IBM-3090Alliant FX/80 2 4 6 810 − − − Number of processors - α e ff Cray Y-MP/8IBM-3090Alliant FX/80Figure 4: Relative speedup (left side) and (1 − α eff ) (right side) values,measured running Linpack on diﬀerent computers with diﬀerent number ofparallel processors. Karp and Flatt (1990)whether the parallelization is uniform in function of the number of cores. Incontrast, the (1 − α eff ) diagrams (right side) show also, that in both cases theimprovement decreased the sequential part, i.e. improved the parallelization.It can also be seen, that in the case of audio stream, the parallelization isimproved and so did the uniformity of parallelization. In the case of radarsignals, without optimization the parallelization decreases as the number ofthe cores increases. With load balancing option on, the parallelization is atany core number gets better. The compiler really does a good job: α eff ispractically constant, the compiler ﬁnds nearly all possibilities:Note that the absolute values in the two cases must not be compared: theyrepresent the sequenctial-only part of the two programs, and they might bediﬀerent for the diﬀerent programs. The uniformity of the values make alsohighly probable, that in the case of audio streams further optimization canbe done, at least for the 2-core and 3-core systems, while processing of radarsignals reached its bounds. In addition, it can also be estimated, that thenon-parallelizable part amounts to ≈ < S and α eff are simply two diﬀerent points of view ofthe same thing. If we have the information, how big is the α fraction of thecode which can be executed in parallel (an architectural point of view), wecan estimate the maximum speedup we can reach. Here we assume that allprocessors have the same architecture. The experimentalist’s point of view90 , , , ,

81 Number of processors Sp ee dup / N oo f p r o ce ss o r s Wave MotionFluid dynamicsBeam stress 10 − − − Number of processors - α e ff Wave MotionFluid dynamicsBeam stressFigure 5: Relative speedup (left side) and (1 − α eff ) (right side) values,measured running diﬀerent algorithms on the same computer with diﬀerentnumber of parallel processors. Karp and Flatt (1990)is diﬀerent: if we can measure the speedup, and know how many processingunits was used, we can estimate how big α eff fraction was running in parallel(assuming the mentioned ideal conditions). Notice also that in deriving α eff no assumption was made on the code architecture or the nature of the com-puting units or their way of linking, so the merit can be used to characterizethe eﬀect of the change , if one of the mentioned components changes. Itmeans, one can use that merit for describing the eﬀect of changing the codearchitecture, or the (behavior of the) interconnecting network, the (internalarchitecture of the) hardware setup, etc. as well. In Table I of Karp and Flatt (1990), diﬀerent architectures are compared,running the same program (Linpack) on computers from diﬀerent manufac-turers and having diﬀerent number of processors. Because the subject ofthe paper was deriving a metric from measured data, here the precision ofthe values is much better. The high degree of parallelization results in α eff values, close to unity, so the value (1 − α eff ) is used in Figure 4.As also in the previous case, the eﬃciency decreases with the increasingnumber of cores. The eﬀective parallelization is nearly constant, and thediﬀerence in the absolute values can be attributed to implementation detailsof the diﬀerent computers. 100 − − Number of processors ( − α e ff ) RingNeighbourhoodBroadcast 10 − − Number of processors ( − α e ff ) RingNeighbourhoodBroadcastFigure 6: (1 − α eff ) values, measured when minimizing Rosenbrock func-tion (left side) and Rastrigin function (right side), on the same SoC, us-ing diﬀerent communication strategies, in function of the used processors.de Macedo Mourelle et al. (2016) In their work de Macedo Mourelle et al. (2016) the authors compare dif-ferent communication stategies their PSO uses when minimizing the Rosen-brock function and the Rastrigin function, respectively. As it could be ex-pected, in the case of the ’broadcast’ type communication the ’sequential’fraction increases with the number of the processors, in other cases practi-cally remains constant. The ﬂuctuation shows the limitations of the (other-wise excellent) measurement precision.It is worth to compare Fig. 6 with Fig 5, right side. All the three diagramsshow the scaling behavior of some procedures, in function of the processingunits. It would be worth to run the processes shown in Fig 5, in the PSO,to ﬁnd out the advantage of having the processing units inside the chip.

In Table II of Karp and Flatt (1990), execution time of diﬀerent programsare given in function of processors. The data are shown in Fig. 5. As pre-sented there, the eﬃciency drops in a catastrophic way as the number of coresincreases, while (1 − α eff ) changes only within the limits of the measurementerror. Notice that Figs. 3, 4 and 5 use the same scale, and that the steeperdecrease of eﬃciency means higher values of (1 − α eff ).11he behavior of eﬃciency deserves some analyzis. As detailed at thebeginning of section 2.1, the distinguished constituent in Amdahl’s classicanalysis is the parallelizable fraction α , all the rest (including wait time,non-payload activity, etc) goes into the ”sequential-only” fraction. Whenusing several processors, one of them makes the sequential calculation, theothers are waiting (use the same amount of time). So, when calculating thespeedup, one calculates S = (1 − α ) + α (1 − α ) + α/k = kk (1 − α ) + α (6)hence the eﬃciency Sk = 1 k (1 − α ) + α (7)This explains the behavior of diagram Sk in function of k in ﬁgures above:the more processors, the lower eﬃciency. In the case of Fig. 5, (1 − α ) is inthe order of 10 − , so the eﬃciency decreases to 0 . processors, while inthe case of Fig. 3, (1 − α ) is ≈ − , so the eﬃciency decreases to 0 . processors. This is why Amdahl made his very reasonable conclusion:” theeﬀort expended on achieving high parallel processing rates is wasted unless itis accompanied by achievements in sequential processing rates of very nearlythe same magnitude ” Amdahl, G. M. (1967).

4. Conclusions

With the spread of both multi-core architectures, using diﬀerent paral-lelization solutions (like diﬀerent networking or reconﬁgurable connection ofcores, etc.) and parallelizing the formerly sequential code either with pro-grammers’ eﬀort or using parallelizing compilers like the one by Sheng et al.(2014), it becomes more and more important problem to characterize quan-titatively the performance of the parallelization. Through inverting the for-mula known as Amdahl’s law, and re-interpreting the comprised quantities,such a ﬁgure of merit was derived. This experimental quantity correctly de-scribes the performance of parallelization, allowing to characterize the per-formance of programmers or parallelizing compilers (see Fig. 3), diﬀerent ar-chitectural solutions with many processors (see Fig. 4), diﬀerent algorithmsin function of the number of the processors (see Fig. 5), as well as describ-ing the performance of the network connection during running the task, or12uantifying the synchronization method used between the computing units.The introduced merit seems to be an adequate measure of the performanceof the technology used for parallelization, unlike the formerly used quantity(speedup divided by the number of computing units).

References

Agarwal, V., Hrishikesh, M., Keckler, S., Burger, D., 2000. Clock Rate ver-sus IPC: The End of the Road for Conventional Microarchitectures. In:Proceedings of the 27th Annual International Symposium on ComputerArchitecture.Amdahl, G. M., 1967. Validity of the Single Processor Approach to AchievingLarge-Scale Computing Capabilities. In: AFIPS Conference Proceedings.Vol. 30. pp. 483–485.Chandy, J. A., Singaraju, J., 1999. Hardware parallelism vs. softwareparallelism. In: USENIX Summer Conference.URL http://static.usenix.org/event/hotpar09/tech/full_papers/chandy/chandy_html

David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you alwayswanted to know about synchronization but were afraid to ask. In: Pro-ceedings of the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples (SOSP ’13). pp. 33–48.de Macedo Mourelle, L., Nedjah, N., Pessanha, F. G., 2016. Reconﬁgurableand Adaptive Computing: Theory and Applications. CRC press, Ch.Chapter 5: Interprocess Communication via Crossbar for Shared Mem-ory Systems-on-chip.Fuller, S. H., Millett, L. I., 2011. Computing Perfor-mance: Game Over or Next Level? Computer 44, 31–38, http://download.nap.edu/cart/download.cgi?&record_id=12980 .Hill, M. D., Marty, M. R., 2008. Amdahls Law in the Multicore Era. IEEEComputer 41 (7), 33–38.Karp, A. H., Flatt, H. P., May 1990. Measuring parallel processor perfor-mance. Commun. ACM 33 (5), 539–543.URL http://doi.acm.org/10.1145/78607.78614 http://doi.acm.org/10.1145/1993316.1993501

Sheng, W., Sch¨urmans, S., Odendahl, M., Bertsch, M., Volevach, V., Leu-pers, R., Ascheid, G., 2014. A compiler infrastructure for embedded het-erogeneous MPSoCs. Parallel Computing 40, 51–68.S(o)OS project, 2010. Resource-independent ex-ecution support on exa-scale systems.