Performance Comparison for Scientific Computations on the Edge via Relative Performance
PPerformance Comparison for ScientificComputations on the Edge via Relative Performance
Aravind Sankaran
IRTG - Modern Inverse ProblemsRWTH Aachen University
Aachen, [email protected]
Paolo Bientinesi
Department of Computer ScienceUme˚a Universitet
Ume˚a, [email protected]
Abstract —In a typical Internet-of-Things setting that involvesscientific applications, a target computation can be evaluatedin many different ways depending on the split of computationsamong various devices. On the one hand, different imple-mentations (or algorithms)—equivalent from a mathematicalperspective—might exhibit significant difference in terms ofperformance. On the other hand, some of the implementationsare likely to show similar performance characteristics. In thispaper, we focus on analysing the performance of a given setof algorithms by clustering them into performance classes. Tothis end, we use a measurement-based approach to evaluate andscore algorithms based on pair-wise comparisons; we refer to thisapproach as “Relative performance analysis”. Each comparisonyields one of three outcomes: one algorithm can be “better”,“worse”, or “equivalent” to another; those algorithms evaluatingto have “equivalent” performance are merged into the sameperformance class. We show that our clustering methodology fa-cilitates algorithm selection with respect to more than one metric;for instance, from the subset of equivalently fast algorithms, onecould then select an algorithm that consumes the least energy ona certain device.
Index Terms —performance analysis, algorithm ranking, clus-tering, scientific computing, distributed computing, edge comput-ing
I. I
NTRODUCTION
With the rise of heterogeneous-computing, the evaluationof a mathematical expression can possibly be split amongvarious devices such as CPU-GPU, CPU-Raspbian, CPU-Smartphone, etc. The optimization of scientific computationsinvolves identifying the part of the code that can be executedon the edge device and the part that can be offloaded toan accelerator (such as a GPU) or a server; based on thesplit of the target computation among devices, there could bemyriads of mathematically equivalent algorithms. The efficientcomputation of mathematical expressions is critical, especiallyfor latency-sensitive applications; for instance, a faster real-time video analytics algorithm facilitates improved responsetime for intelligent vehicle applications [1]. However, whensuch algorithms are run on a resource-constrained hardware,it becomes important to ensure that the executions are alsoenergy efficient. Therefore, in an Internet-of-Things (IoT)
Financial support from the Deutsche Forschungsgemein- schaft (GermanResearch Foundation) through grants GSC 111 and IRTG 2379 is gratefullyacknowledged. setting, the code is typically optimized with respect to morethan one performance criteria (such as least execution timeand energy).In this paper, a given set of mathematically equivalentalgorithms A is clustered into performance classes C i ⊆ A where i ∈ { , , . . . , k } and k is the number of performanceclasses . All the algorithms in a class C i exhibit equivalentperformance (based on one of the performance criteria), andthe algorithms in C i perform better than those in C i +1 . Thesubsets C i ’s are specific to a given computing architecture,operating system, and run time settings. Such a clustering ofalgorithms into performance classes facilitates the selectionof an algorithm based on additional performance criteria; forinstance, if the algorithms were clustered based on executiontime, then from the subset of fast algorithms, one couldchoose the algorithm that performs at-most X floating pointoperations on an energy-constrained edge device.The algorithms in A represent different, alternative waysof splitting a target computation among various device. Inexact arithmetic, those algorithms would all return the samequantity. For instance, consider the “scientific code” in Figure1a consisting of two separate loops L1 and L2 (assume that L2cannot be executed before the completion of L1), each callinga certain function that performs matrix-matrix multiplication.Let us assume that the code is invoked from an edge device(D) and we have an option to offload a part of (or thewhole) computation to an accelerator (A); if one restrictsthe granularity of offloading to an entire loop, then there arealready four different ways to split the computations among thedevices (shown in Figure 1a). That is, either both L1 and L2can be executed in the edge device (“DD”) or both offloaded toan accelerator (“AA”); or just one of the loops can be offloadedto an accelerator (“DA” or “AD”).In order to score the algorithms in A and cluster them intoperformance classes, we use a measurement-based approach.Performance measurements such as execution time, are usuallyinfluenced by many factors, and that repeated measurementsoften result in different numbers [2]–[5]. As the variability inmeasurements increases (due to system noise), a single number(such as statistical mean, median or minimum) cannot reliably k is determined dynamically and is not specified at the start. a r X i v : . [ c s . PF ] M a r a) Different ways of splitting computations among devices(b) Distributions of execution times of the equivalent algorithmswith computations split between CPU (D) and GPU (A) Fig. 1: Splitting parts of a scientific code among devices.capture the performance of an algorithm; as a consequence,the result of algorithm comparison might not be consistentwhen the performance measurements are repeated. Therefore,comparing the performance of any two algorithms should im-ply comparing two sets of measurements (or “distributions”),and the result should yield one of three outcomes: “better”,“worse” or “equivalent”; in order for one algorithm to be betteror worse than the other, there should be a significant differencein their distributions (for example, distributions “AD” and“DD” in Figure 1b), otherwise, the two algorithms are saidto be equivalent (for example, distributions “DD” and “DA”in Figure 1b).We use the three-way pair-wise comparison to sort thealgorithms in A according to the bubble-sort procedure [6]. Inorder to cluster the algorithms in A into performance classes,those algorithms evaluating to have equivalent performancein the pair-wise comparison are merged into the same class.The distributions of execution times of the four algorithms(implemented in Tensorflow 2.1 [7]), with computations split between a CPU (Intel(R) Xeon(R) Platinum 8160, 1 core) anda GPU (Nvidia Pascal P100 SXM2), is shown in Figure 1b.One could observe that the algorithm with just L1 offloaded tothe GPU (“AD”) performs significantly better than the rest; al-though the L2 consists of a larger matrix-matrix multiplication(which in general should perform better in a GPU), for thisparticular example, the overhead caused by the larger data-movement between CPU and GPU is slightly more than thespeed-up gain. In this paper, we do not make any assumptionabout the statistical nature of the distribution of performancemeasurements; thus, the approach extends naturally to anyDevice-Accelerator(s) combinations (such as CPU-Raspbian,Smartphone-GPU(s) etc.). The analysis can also be extendedto any arbitrary mathematical function beyond matrix-matrixmultiplication.Scientific codes consisting of a sequence of loops, each eval-uating a certain mathematical expression, can be commonlyencountered in many practical applications. We describe acouple of them:1) Digital-Twin applications involving multi-scale mod-elling.
Evaluation of mathematical expressions in a loopis often encountered in simulation problems. Solvinga hierarchy of such problems (where results from onesimulation are used to solve the next one) with vary-ing computational volumes is known as multi-scalemodelling. Some applications of multi-scale modellinginclude addressing challenges in highly contemporaryissues such as climate, energy, biology, sociology, fi-nance, etc. [8]. In recent times, many real-time, latency-critical applications that involve multi-scale modellingare being realized because of the emergence of conceptssuch as Digital-Twin [9]. A digital representation thatinfers real-time updates of a physical object or a system,mostly by solving simulation problems, is known asa Digital-Twin. Real-time evaluation of complex sim-ulations has become possible only in the recent times,and this is because of neural-network integration withmulti-scale modelling [10]; the knowledge of physicsis used to build such neural networks, and these meth-ods are commonly known as Physics Informed NeuralNetworks [11]. For many such applications, scientificcomputations are performed on data that is collectedfrom many devices (sensors, mostly constrained withcomputational capacity) and the split of computationsamong the computing resources has to be optimizedto minimize response time. In [9], the authors high-light about the need to address the computational andhardware related challenges that arise in the practicaladaptation of such applications.2)
Hierarchical object-detection in images.
A wide rangeof real-life applications ranging from autonomous driv-ing to security in work environments have been realizedwith real-time object-detection [1]. The on-board pro-cessor in a hand-held device (such as a smartphone) oran autonomous drone, although constrained in computa-2ional capacity, can still be used to run low-fidelity objectdetectors (such as YOLO [12]) for quick identification ofobjects. However, higher fidelity object detectors (suchas SSD [13]) can run simultaneously in the backgroundand can be used to correct the low-fidelity detections (incase of error), but with a lag. This lag can be minimizedby properly choosing the parts of the code that could beoffloaded to an accelerator.For our experiments, we consider a scientific code thatsolves a sequence of mathematical problems implemented inTensorflow 2.1 [7]. Specifically, we consider a mathematicalproblem that consist of solving the Regularized Least Squaresequation [14] in a loop. There are more than one such loopsand each can be executed either on a CPU or a GPU . Weevaluate the performance of the algorithms by measuring theexecution time, and clustering them into performance classes.We show that such a clustering can aid in selecting thealgorithm that minimizes the operating cost or the number offloating point operations performed (which minimizes energy)on a particular device. Contributions:
We use the methodology we developed in[15] to cluster a given set of equivalent algorithms A . Incontrast to our analysis in [15], where we aimed to identifyonly the subset of “fastest” algorithms, we now focus onclusters of algorithms from all performance classes. One of thereasons for looking beyond just the fastest algorithms is that,in an edge computing environment, one might be interestedin switching between algorithms from different performanceclasses to optimize energy in a device; for instance, whenenergy consumption of a particular device reaches a certainthreshold, one might be interested in switching to an algorithmthat performs fewer floating point operations (FLOPs) onthat device, and then switches back to the high-performancealgorithm after a while. In this work, the execution timesof the algorithms are explicitly measured and the resultingclusters are specific to the underlying architecture and runtime settings; if the operating conditions are changed, themeasurements have to be repeated. However, these clusters canbe used as ground truth to train performance models that canautomatically identify the algorithm of required performancewithout executing them . Moreover, such performance modelsfor automatic algorithm selection can obtain better accuracywhen trained with a particular loss function, known as Tripletloss [16], where both positive (fast algorithm) and negative(worst algorithm) example are used to train the model; for sucha training, the algorithms clustered into different performanceclasses would be required. Organization : In Sec. II, we survey the state of art. InSec. III, we describe the methodology to cluster the equivalentalgorithms and discuss some applications in Sec. IV. Finally, in All the other device-accelerator settings such as CPU-Smartphone, CPU-Raspbian etc., can be simulated by adding artificial delays and controlling thenumber of threads. The modelling and prediction of relative performance is the objective ofour future work, and it is out of the scope of this article.
Sec. V, we draw conclusions and discuss the need for relativeperformance modelling.II. R
ELATED W ORKS
In the past decade, we have witnessed an increasing numberof latency sensitive applications, such as intelligent vehicles[1], real-time video analytics [17], augmented reality [18]etc., that require costly scientific computations on resource-constrained hardware. For such applications, offloading allthe computations to a cloud is not an option because ofunacceptable communication latency between the cloud andend devices [19] [20]. On the other hand, increasing ca-pabilities of mobile devices make them more suitable forscientific computing loads [21] [22]. Consequently, there is atrend of computation offloading [23] toward edge computing[20], [24], [25], where significant gains are realized whencomputations are shifted towards the edge of the network(including the local device) [26]. For instance, a proof-of-concept platform that runs a face recognition application in[27] shows that the response time is reduced from 900 to 169ms by moving computation from cloud to the edge. Dependingon the distribution of workload among devices, myriads ofimplementations for the same computational problem can bedevised, each having a significant impact on both latency andenergy consumption. For instance, Clone cloud in [28] doeson-demand partitioning of workload between mobile and thecloud, and their prototype could reduce 20x running time andenergy for the tested application.The increase of such decentralized applications in recenttimes can be attributed to containerization tools like Docker[29] that ease portability of code without compromising per-formance [30] [31] and enable the use of same Linux en-vironment across heterogeneous devices. Furthermore, recenthigh level programming languages and environments such asJulia [32], Tensorflow [7], PyTorch [33] etc that abstract thecalls to linear algebra libraries specially optimized for differ-ent hardware (such as raspberry Pi, GPU), allow developersto write code without worrying about the performance thatcan be achieved on the underlying architecture. Therefore,the same scientific code can achieve high performance ondifferent devices, irrespective of the operating system, andthe computations can easily be moved among devices just bymodifying a flag value.The improved ease in developing high performance scien-tific code pushes academicians into new research directionsthat leads to realization of many real-time applications, whichearlier seemed impractical. For instance, traditional physics-based simulations are slow, and are used only to design, gaininsights into system behaviour, or to analyse outputs and derivetheories. However, with the increasing adaptation of physicswith fast neural network computations – commonly referredby the umbrella term “Physics Informed Neural Networks”(PINNs) [11] – significant speed ups in solving solvingcomplex simulation problems have been reported off-late [34],[35]. In [36], the authors survey and report a wide rangeof new technological applications with PINNs. In general,3INNs integrate data and physics-based approaches to createa virtual representation of a real system, commonly knownas a Digital twin, which are then synchronized to realizemany practical applications involving failure prediction andmaintenance response on a factory floor, remote monitoring ofautonomous vehicles etc [37], [38]. Most of these applicationsinclude the use of sensors through which data are collected andthe computations can be distributed among many devices. In[9], the authors study the feasibility of practical adaptation ofsuch technologies and mention about the need to address thecomputational and hardware related challenges that arise inthe real-time environment.As a first step towards addressing the challenges, a frame-work to quantify and interpret performance of scientific codesin an edge computing setting is essential. While the previousworks such as [39]–[41] optimize performance by finding outwhere to offload an entire computational task, we focus moreon optimally splitting a single computational task among de-vices. Unlike [28], [42], [43], where the focus of performanceoptimization is on a specific kind of computational task oroperating environment, we attempt to arrive at a methodologythat is more generic for scientific computations in general.Although our approach is measurement-based (the implemen-tations have to be executed and timed), the results of ourmethodology will form a basis to derive automatic (execution-less) computation splitting strategies with reinforcement learn-ing, or serve as a ground truth for performance prediction usingsupervised learning.III. M
ETHODOLOGY
Given a set of p equivalent algorithm alg , . . . alg p ∈A , we want to form performance clusters C i ⊆ A where i ∈ { , . . . , k } and k ( ≤ p ) is the number of clusters. k is determined dynamically and need not be known at thestart. The performance of the algorithms in cluster C i isexpected to be greater than those in C i +1 . To this end, wefirst execute and measure the performance (say executiontime) of every algorithm in A multiple times (say N times).When the fluctuations in the performance measurements arelarge, it becomes difficult the summarize the performance intoa single number. Instead of summarizing the performancestatistic (such as mean or minimum execution time) of allthe N measurements into one number, multiple statistics areevaluated and compared on data that is randomly sampledfrom the N measurements; this approach is commonly knownas “bootstrapping”. This allows us to gain more informationfrom the set of measurements (also referred as distributionsor histogram) of execution times, which are then used forperformance comparisons [15]. The distributions are comparedpair-wise and merged into the same cluster if the comparisonof two algorithms evaluates to be performance-equivalent.Two algorithms are said to be equivalent if their distributionsof measurements significantly overlap. The “bootstrapping”strategy is used to quantify the overlap of distributions oftwo algorithms and classify one to be “better”, “equivalent” Procedure 1
SortAlgs ( A ) Input: alg , alg , . . . , alg p ∈ A Output: (cid:104) ( alg s [1] , rank ) , . . . , ( alg s [ p ] , rank p ) (cid:105) for i = 1, . . . , p do Initialize rank i ← i (cid:46) Initialize Alg rank Initialize s i ← i (cid:46) Initialize Alg Indices Initialize
S ← (cid:104) ( alg s [1] , rank ) , . . . , ( alg s [ p ] , rank p ) (cid:105) for i = 1, . . . , p do for j = 0, . . . , p-i-1 do Compare alg s [ j ] and alg s [ j +1] S ←
UpdateAlgIndices( S , s [ j ] ) S ←
UpdateAlgRanks( S , s [ j ] ) return (cid:104) ( alg s [1] , rank ) , . . . , ( alg s [ p ] , rank p ) (cid:105) or “worse” than the other; the procedure to implement thiscomparison strategy is described in Section IV of [15]. Procedure 2
UpdateAlgIndices ( S , i ) Input: (cid:104) ( alg , rank ) . . . , ( alg p , rank p ) (cid:105) ∈ S i ∈ { , . . . , p } Output:
Updated sequence set S if alg i < alg i +1 then Swap positions of alg i and alg i +1 return Updated sequence set S Procedure 3
UpdateAlgRanks ( S , i ) Input: (cid:104) ( alg , rank ) . . . , ( alg p , rank p ) (cid:105) ∈ S i ∈ { , . . . , p } Output:
Updated sequence set S if alg i ∼ alg i +1 and rank i (cid:54) = rank i +1 then rank i +1 , . . . , rank p decreased by 1 if alg i > alg i +1 and rank i (cid:54) = rank i +1 then if rank i = rank i − then rank i +1 , . . . , rank p decreased by 1. if alg i > alg i +1 and rank i = rank i +1 then if rank i (cid:54) = rank i − then rank i +1 , . . . , rank p increased by 1. return Updated sequence set S We use the three-way comparison function to sort the setof algorithms in A ; that is, we do a sort in which thecomparison is not just a binary relation (better or worse),but also admits the equivalence of two algorithms. Thissorting procedure is summarized in Procedure 1. The out-come of Procedure 1 is a sequence set S consisting oftuples (cid:104) ( alg s [1] , rank ) , . . . , ( alg s [ p ] , rank p ) (cid:105) where s [ j ] isthe index of the j th algorithm in the sorted sequence and rank j ∈ { , . . . , k } is the cluster assigned to alg j . In thebeginning, the sequence set S is randomly initialized (see In this work, we focus only on the update in each iteration of the sort andthe sorting procedure as a whole is not optimized for performance. alg DD , alg AA , alg DA , alg AD ∈A in Figure 1a. The distributions of time measure-ments obtained by N =500 measurements of each algo-rithm are shown in Figure 1b. Let the initial sequenceset S be (cid:104) ( alg s [ ]= DD , , ( alg s [ ]= AA , , ( alg s [ ]= DA , , ( alg s [ ]= AD , (cid:105) . The intermediate steps of the bubble sortprocedure for this illustration is shown in Figure 2. In everyiteration of the sort procedure, adjacent pairs of algorithms alg s [ j ] and alg s [ j +1] , starting from the right-most algorithmin S , are compared using the bootstrap strategy from [15] andthe following update rules are applied:1) Update of Algorithm Indices:
An algorithm occurringearlier in the sorted sequence S should perform at leastas good as those occurring later from it in the se-quence. If an algorithm alg s [ j ] performs worse ( < ) thanits successor alg s [ j +1] , then the indices (or position)of both the algorithms in the sequence are swapped.Otherwise, the positions are left to remain the same.In the first comparison of our example, alg s [ ]= DD and alg s [ ]= AA are compared and alg DD performsworse than its successor; hence, the two algorithms swappositions (now alg s [ ]= AA and alg s [ ]= DD ).2) Update of algorithm ranks : a) If the algorithms have not been swapped:
Thealgorithms are not swapped if alg s [ j ] evaluates tobe better ( > ) or equivalent ( ∼ ) to alg s [ j +1] . If thecomparison is “equivalent”, alg s [ j ] and alg s [ j +1] should be assigned with the same rank; if theranks are different, they are merged by decreasingthe ranks of alg s [ j +1] , . . . alg s [ p ] by 1. If thecomparison is “better”, the ranks are not updated.In the second comparison of our illustration, theperformance of alg DD and alg DA results to be“equivalent”, but both have different ranks. Hence,the ranks of algorithms occurring after alg DD in the sequence are decreased by 1; alg DD and alg DA now share rank 2 and the rank of alg AD is corrected to 3.b) If the algorithms have been swapped:
Theindices of the algorithms are swapped only if alg s [ j ] performs worse ( < ) than its succes-sor alg s [ j +1] . After swapping the positions of alg s [ j ] and alg s [ j +1] , if “ alg s [ j ] has the samerank as its predecessor alg s [ j − , but differ-ent rank from its successor alg s [ j +1] ”, then theranks of alg s [ j +1] , . . . alg s [ p ] are decreased by1. However, if “ alg s [ j ] shares the same rankwith its successor alg s [ j +1] , but has a differentrank from its predecessor alg s [ j − ”, then theranks of alg s [ j +1] , . . . alg s [ p ] are increased by 1.For instance, in the third comparison in Figure Fig. 2: Bubble Sort with three-way comparison2, alg s [ ]= DA performs worse than alg s [ ]= AD and they swap positions – now alg s [ ]= AD and alg s [ ]= DA – and alg s [ ]= AD has the same rankas its predecessor alg s [ ]= DD (rank 2) but differentfrom its successor alg s [ ]= DA (rank 3); hence therank of alg s [ ]= DA is decreased by 1, so that allthe three algorithms alg DD , alg AD , alg DA nowshare the same rank. In step 4 of our illustration,the algorithms with the same rank alg s [ ]= DD and alg s [ ]= AD swap positions, but after the swap alg s [ ]= AD has a different rank from its predeces-sor. Since alg s [ ]= AD reached the top of its per-formance class – having defeated all the other al-gorithms with the same rank – alg s [ ]= AD shouldnow be assigned with a better rank than the otheralgorithms in its class. To this end, alg s [ ]= AD stays at rank 2, but its successors alg s [ ]= DD and alg s [ ]= DA are pushed to rank 3.Following the steps in Procedure 1, thefinal sequence set for our illustration results as (cid:104) ( alg s [ ]= AD , , ( alg s [ ]= AA , , ( alg s [ ]= DD , , ( alg s [ ]= DA , (cid:105) , where the algorithms are clusteredinto three performance classes. Computing the relative scores : The clustering proce-dure discussed above is not deterministic, especially whenthe fluctuations in the performance measurements are largeand the distributions are partially overlapping (for example,distributions “AD” and “AA” in Figure 1b). Such overlapsbecome more evident when the number of measurements N is small. For N = 30 , alg AD is just at the threshold ofbeing better than alg AA ; hence, in the last step of our sortingprocedure, the result of comparing these two algorithms wouldswitch between alg AA < alg AD and alg AA ∼ alg AD ,5nd this can change the final rank of the algorithms. As alg AA results to be equivalent to alg AD once in every threecomparisons, approximately thirty percent of the times alg AA is assigned to rank 1 cluster (and to the rank 2 cluster restof the time). Therefore, we compute relative scores for thealgorithms separately for each cluster. To this end, Procedure1 is repeated Rep times after shuffling the set A before eachclustering and if an algorithm alg j is assigned to cluster r (orwith rank r ) in at least one out of Rep iterations, alg j willreceive a “relative score” of w j /Rep (with respect to cluster r ), where w j ( ≤ Rep ) is the number of times alg j obtains rank r (see Procedure 4). The relative score of an algorithm alg j with respect to cluster r represents the confidence of alg j inbeing assigned to cluster r . For our illustration, we obtain thefollowing relative score estimates for algorithms assigned toeach of the cluster: • Rank 1 cluster C : { ( alg AD , . , ( alg AA , . } • Rank 2 cluster C : { ( alg AA , . , ( alg DD , . , ( alg DA , . } • Rank 3 cluster C : { ( alg DD , . , ( alg DA , . } • Rank 4 cluster C : { ( alg DA , . } The resulting clusters could be used as a ground truth totrain performance models. If the assignment of algorithms tomore than one cluster is not preferred for training performancemodels, then one could simply assign the algorithm to thecluster for which it obtained the maximum relative scoreand compute the final relative score by summing the scoresfrom better ranks; for instance, alg DA was assigned rank3 in 60 percent of the iterations and rank 2 in 30 percentof the iterations, and hence alg DA would be assigned withrank 3, for which it obtained the maximum relative score of0.6, and its relative score from rank 2 would be cumulated,thereby resulting in a final relative score of 0.9. Then, the finalclustering would be C : { ( alg AD , . } ; C : { ( alg AA , . } ; C : { ( alg DD , . , ( alg DA , . } .IV. E XPERIMENTS
The main motivation behind clustering of equivalent algo-rithms into performance classes is to aid in algorithm selection.The strategy to select the best algorithm, especially in anedge-computing environment, is not always based on just oneperformance metric (such as selecting the algorithm with theminimum execution time). For instance, if a computationallyintensive algorithm is running on a resource constrained edgedevice such as a smart phone or a tablet, in addition toaiming for a fast execution of the code, it also becomesessential to monitor the resource usage from time to timeso that the device does not consume more energy than theprescribed budget. One way to control the resource utilizationon a device is by restricting the number of floating point We do not repeat the execution of the algorithms, but repeat the procedureover the same set of N measurements. Procedure 4
GetCluster r ( A , Rep, R ) Input: alg , alg , . . . , alg p ∈ A Reps, R ∈ Z + Output: ( a , w ) , ( a , w ) , . . . , ( a q , w q ) ∈ C r × R a ← [ ] (cid:46) Initialize empty lists a tmp ← [ ] for i = 1, . . . , Rep do Shuffle( A ) SortAlgs ( A ) ˜ a tmp ← select algorithms with rank R append ˜ a tmp to the list a tmp a ← select unique algorithms in a tmp (cid:46) | a | ≤ | a tmp | q = | a | for i = 1, . . . , q do w i ← number of occurrences of a i in a w i ← w i /Rep return ( a , w ) , ( a , w ) , . . . , ( a q , w q ) operations (FLOPs) performed by the scientific code on thatdevice. To this end, one can choose an alternate equivalentalgorithm that restricts the FLOPs on that particular device byshipping parts of the computation to an accelerator. Now, theproblem of algorithm selection arises when one has to decideon which parts of the computation should be exported to anaccelerator so that the compromise on the overall executiontime of the code is minimum. Recall that based on thesplit of computations, there can be many alternate algorithmswith different performance characteristics. In this section, wedemonstrate potential usages of our clustering methodology infacilitating algorithm selection.For our experiment, we consider a scientific code that callsa sequence of three “ M athT asks ” (
L1, L2, L3 in Procedure5). The computational volume (number of FLOPs) of every
M athT ask is different and every task computes a “penalty”that is used to solve the next task; hence, the three tasks cannotbe executed concurrently. This kind of set-up is common inmulti-scale modelling [8]. For our analysis, we consider a
M athT ask that solves the Regularized Least Squares (RLS)equation in a loop (see Procedure 6) and the size of thematrices in the RLS equation determine the computationalvolume of the
M athT ask ; higher size requires more FLOPsfor computation. Each
M athT ask can be computed eitheron a single core of Intel(R) Xeon(R) Platinum 8160 CPU(Edge device: D) or offloaded to Nvidia Pascal P100 SXM2GPU (Accelerator: A). Although the GPU can perform moreFLOPs per second, it has an extra cost due to data movement;therefore, offloading the entire code to the accelerator does notnecessarily reduce the overall execution time. Depending onwhich part of the code – L1 , L2 or L3 – is offloaded to theaccelerator, there are totally 8 possible equivalent algorithms(see Table I). The execution time of every algorithm ismeasured 30 times and by applying the methodology described The code (implemented in Tensor Flow 2.1 [7]) is available at“https://github.com/HPAC/Relative-Performance”.
6n Section III, the distributions of measurements have beenclustered into five performance classes. Table I shows therelative scores of the algorithms in each cluster.
Procedure 5
Scientific Code (
D1, D2, D3 ) Input: D1, D2, D3 ∈ { device (D) , accelerator (A) } penalty ← Run on D1 : penalty ← M athT ask (50, penalty) (cid:46) L1 Run on D2 : penalty ← M athT ask (75, penalty) (cid:46) L2 Run on D3 : penalty ← M athT ask (300, penalty) (cid:46)
L3Procedure 6
M athT ask (size, penalty)
Input: size, penalty ∈ R for i=1, . . . , n do Randomly generate
A (cid:46) A ∈ R size × size Randomly generate
B (cid:46) B ∈ R size × size Z ← ( A T A + penalty. I ) − A T B penalty ← || AZ − B || return penality Cluster Algorithm Relative Score C alg DDA alg
DAA C alg DDD alg
DAA C alg ADA alg
ADD alg
DAD C alg AAA alg
DAD C alg AAD
TABLE I: Clustering of algorithmsLet us now assume that there is an operating cost involved inexecuting the code on the accelerator (A). In order to minimizesuch a cost, the best option would be to execute the whole codeon the edge device (D). But executing the entire code on theedge device ( alg
DDD ) is not the optimum choice in terms ofperformance; yet alg
DDD is not so bad, as it results in thesecond best performance class C . Procuring an accelerator andoffloading L3 to it would improve performance, as algorithm alg DDA results in the best performance class C . Whether oneshould spend money on an accelerator or not would dependon the margin of speed up and what the application as awhole can gain through that speed up. Therefore, the choice ofalgorithm is now based on a “decision-model” that is a trade-off between operating cost and speed. In our example, for asmall loop size of n = 10 in Procedure 6, the mean execution time of alg DDA is just 0.002s more than alg
DDD and thespeed up is approximately 1.05. When n becomes larger, thespeed up increases. Thus, a decision-model can make a trade-off between n , relative scores and operating cost to aid inalgorithm selection; the weight on the operating cost woulddepend on the importance of speed-up for the application. Forlatency critical applications such as those involving responseof the autonomous vehicle to external conditions, even a smallimprovement of 0.002s in the execution time can make asignificant difference in the quality of the application; forinstance, the safety of autonomous vehicles can improve ifan object detector can evaluate more objects per second.Let us consider another application where it is ideal torun the whole code on the edge device ( alg DDD ); however,the device cannot persistently handle all the computationsbecause of energy constraints. Therefore, in regular intervals,the amount of computations on the edge has to be reducedfor a small period of time. In such a case, one can switch to alg
DAA among the algorithms in C , as it offloads most ofthe computations to the accelerator, and then switch back to alg DDD when the device cools down.V. C
ONCLUSION AND F UTURE O UTLOOK
Efficient evaluation of scientific computations on the Edgeis essential for latency-sensitive application such as intelligentvehicles [1], augmented reality [18] etc. For a given scien-tific computation, we considered all the equivalent solutionalgorithms based on the split of computation among devices;these solution algorithms are mathematically equivalent to oneanother, but can exhibit significant difference in performance.We presented a measurement-based approach to cluster thesealgorithms into performance classes through pair-wise com-parison of performance measurements. We showed that sucha clustering can aid in algorithm selection based on additionalcriteria such as operating cost and energy.Although we considered a use-case where it is possible toexecute and measure all the different combination of solutionalgorithms, this may not be an ideal solution for applica-tions where there is an exponential number of alternativeimplementations. For instance, the linear algebra expressionin the line 4 of Procedure 6 can alone have many differentequivalent algorithms, each having a different sequence ofcalls to optimized libraries such as BLAS and LAPACK [44];typically these algorithms also show significant difference inperformance, even without considering the split of computa-tion among devices [15] [45]. Therefore, in case of exponentialexplosion of the search space, our methodology can still beapplied on a subset of possible solutions and the resultingclusters with relative scores can be used as a ground truthto guide the search of algorithm via reinforcement learning.Thus, our methodology can be used to develop performancemodels that predict relative scores without having to executeall the algorithms. R
EFERENCES[1] D. Grewe, M. Wagner, M. Arumaithurai, I. Psaras, and D. Kutscher,“Information-centric mobile edge computing for connected vehicle en- ironments: Challenges and research directions,” in Proceedings of theWorkshop on Mobile Edge Communications, 2017, pp. 7–12.[2] E. Peise and P. Bientinesi, “A study on the influence of caching:Sequences of dense linear algebra kernels,” in International Conferenceon High Performance Computing for Computational Science. Springer,2014, pp. 245–258.[3] T. Hoefler, T. Schneider, and A. Lumsdaine, “Characterizing the in-fluence of system noise on large-scale applications by simulation,” inSC’10: Proceedings of the 2010 ACM/IEEE International Conferencefor High Performance Computing, Networking, Storage and Analysis.IEEE, 2010, pp. 1–11.[4] E. Peise and P. Bientinesi, “Performance modeling for dense lin-ear algebra,” in 2012 SC Companion: High Performance Computing,Networking Storage and Analysis. IEEE, 2012, pp. 406–416.[5] R. Iakymchuk and P. Bientinesi, “Modeling performance throughmemory-stalls,” ACM SIGMETRICS Perform. Eval. Rev., vol. 40,no. 2, p. 86–91, Oct. 2012. [Online]. Available: https://doi.org/10.1145/2381056.2381076[6] O. Astrachan, “Bubble sort: an archaeological algorithmic analysis,”ACM Sigcse Bulletin, vol. 35, no. 1, pp. 1–5, 2003.[7] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th { USENIX } symposium on operatingsystems design and implementation ( { OSDI }
43] J. Ren, Y. Guo, D. Zhang, Q. Liu, and Y. Zhang, “Distributed andefficient object detection in edge computing: Challenges and solutions,”IEEE Network, vol. 32, no. 6, pp. 137–143, 2018.[44] C. Psarras, H. Barthels, and P. Bientinesi, “The linear algebra mappingproblem,” arXiv preprint arXiv:1911.09421, 2019.[45] H. Barthels, C. Psarras, and P. Bientinesi, “Linnea: Automatic generationof efficient linear algebra programs,” arXiv preprint arXiv:1912.12924,2019.43] J. Ren, Y. Guo, D. Zhang, Q. Liu, and Y. Zhang, “Distributed andefficient object detection in edge computing: Challenges and solutions,”IEEE Network, vol. 32, no. 6, pp. 137–143, 2018.[44] C. Psarras, H. Barthels, and P. Bientinesi, “The linear algebra mappingproblem,” arXiv preprint arXiv:1911.09421, 2019.[45] H. Barthels, C. Psarras, and P. Bientinesi, “Linnea: Automatic generationof efficient linear algebra programs,” arXiv preprint arXiv:1912.12924,2019.