[PDF] Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra

Abstract

Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware fundamental kernels, hiding the architecture heterogeneity from the task scheduler. For the specific domain of dense linear algebra, we show that this is not only possible but delivers much higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution also outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated scheduling techniques.

Full PDF

RRevisiting Conventional Task Schedulers to ExploitAsymmetry in ARM big.LITTLE Architectures for DenseLinear Algebra

Luis Costero a , Francisco D. Igual a, ∗ , Katzalin Olcoz a , Enrique S. Quintana-Ort´ı b a Departamento de Arquitectura de Computadores y Autom´atica, Universidad Complutense de Madrid,Madrid, Spain b Departamento de Ingenier´ıa y Ciencia de Computadores, Universidad Jaume I, Castell´on, Spain

Abstract

Dealing with asymmetry in the architecture opens a plethora of questions from the per-spective of scheduling task-parallel applications, and there exist early attempts to addressthis problem via ad-hoc strategies embedded into a runtime framework. In this paper wetake a diﬀerent path, which consists in addressing the complexity of the problem at thelibrary level, via a few asymmetry-aware fundamental kernels, hiding the architectureheterogeneity from the task scheduler. For the speciﬁc domain of dense linear algebra,we show that this is not only possible but delivers much higher performance than a naiveapproach based on an asymmetry-oblivious scheduler. Furthermore, this solution alsooutperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated schedul-ing techniques.

Keywords:

Dense linear algebra, Task parallelism, Runtime task schedulers,Asymmetric architectures

1. Introduction

The end of Dennard scaling has promoted heterogeneous systems into a mainstreamapproach to leverage the steady growth of transistors on chip dictated by Moore’slaw [1, 2]. ARM R (cid:13) big.LITTLE TM processors are a particular class of heterogeneousarchitectures that combine two types of multicore clusters, consisting of a few high per-formance (big) cores and a collection of low power (LITTLE) cores. These asymmetricmulticore processors (AMPs) can in theory deliver much higher performance for the samepower budget. Furthermore, compared with multicore servers equipped with graphicsprocessing units (GPUs), NVIDIA’s Tegra chips and AMD’s APUs, ARM big.LITTLEprocessors diﬀer in that the cores in these systems-on-chip (SoC) share the same instruc-tion set architecture and a strongly coupled memory subsystem. ∗ Corresponding author

Email addresses: [email protected] (Luis Costero), [email protected] (Francisco D. Igual), [email protected] (Katzalin Olcoz), [email protected] (Enrique S. Quintana-Ort´ı)

Preprint submitted to Elsevier October 9, 2018 a r X i v : . [ c s . D C ] S e p ask parallelism has been reported as an eﬃcient means to tackle the considerablenumber of cores in current processors. Several eﬀorts, pioneered by Cilk [3], aim to easethe development and improve the performance of task-parallel programs by embeddingtask scheduling inside a runtime (framework). The beneﬁts of this approach for com-plex dense linear algebra (DLA) operations have been demonstrated, among others, byOmpSs [4], StarPU [5], PLASMA/MAGMA [6, 7], Kaapi [8], and libflame [9]. In gen-eral, the runtimes underlying these tools decompose DLA routines into a collection ofnumerical kernels (or tasks), and then take into account the dependencies between thetasks in order to correctly issue their execution to the system cores. The tasks are there-fore the “indivisible” scheduling unit while the cores constitute the basic computationalresources.In this paper we introduce an eﬃcient approach to execute task parallel DLA routineson AMPs via conventional asymmetry-oblivious schedulers. Our conceptual solutionaggregates the cores of the AMP into a number of symmetric virtual cores (VCs) whichbecome the only basic computational resources that are visible to the runtime scheduler.In addition, an specialized implementation of each type of task, from an asymmetry-awareDLA library, partitions each numerical kernel into a series of ﬁner-grain computations,which are eﬃciently executed by the asymmetric aggregation of cores of a single VC. Ourwork thus makes the following speciﬁc contributions: • We target in our experiments the Cholesky factorization [10], a complex operationfor the solution of symmetric positive deﬁnite linear systems that is representativeof many other DLA computations. Therefore, we are conﬁdent that our approachand results carry beyond a signiﬁcant fraction of LAPACK (

Linear Algebra PACK-age ) [11]. • For this particular factorization, we describe how to leverage the asymmetry-oblivious task-parallel runtime scheduler in OmpSs, in combination with a data-parallel instance of the BLAS-3 ( basic linear algebra subprograms ) [12] in the BLISlibrary speciﬁcally designed for ARM big.LITTLE AMPs [13, 14]. • We provide practical evidence that, compared with an ad-hoc asymmetry-consciousscheduler recently developed for OmpSs [15], our solution yields higher performancefor the multi-threaded execution of the Cholesky factorization on an Exynos 5422SoC comprising two quad-core clusters, with ARM Cortex-A15 and Cortex-7 tech-nology. • In conclusion, compared with previous work [13, 15], this paper demonstrates that,for the particular domain of DLA, it is possible to hide the diﬃculties intrinsic todealing with an asymmetric architecture (e.g., workload balancing for performance,energy-aware mapping of tasks to cores, and criticality-aware scheduling) inside anasymmetry-aware implementation of the BLAS-3. As a consequence, our solutioncan refactor any conventional (asymmetry-agnostic) scheduler to exploit the taskparallelism present in complex DLA operations.The rest of the paper is structured as follows. Section 2 presents the target ARMbig.LITTLE AMP, together with the main execution paradigms it exposes. Section 3reviews the state-of-the-art in runtime-based task scheduling and DLA libraries for (het-erogeneous and) asymmetric architectures. Section 4 introduces the approach to reuse2onventional runtime task schedulers on AMPs by relying on an asymmetric DLA li-brary. Section 5 reports the performance results attained by the proposed approach; andSection 6 closes the paper with the ﬁnal remarks.

2. Software Execution Models for ARM big.LITTLE SoCs

The target architecture for our design and evaluation is an ODROID-XU3 boardcomprising a Samsung Exynos 5422 SoC with an ARM Cortex-A15 quad-core processingcluster (running at 1.3 GHz) and a Cortex-A7 quad-core processing cluster (also operatingat 1.3 GHz). Both clusters access a shared DDR3 RAM (2 Gbytes) via 128-bit coherentbus interfaces. Each ARM core (either Cortex-A15 or Cortex-A7) has a 32+32-Kbyte L1(instruction+data) cache. The four ARM Cortex-A15 cores share a 2-Mbyte L2 cache,while the four ARM Cortex-A7 cores share a smaller 512-Kbyte L2 cache.Modern big.LITTLE SoCs, such as the Exynos 5422, oﬀer a number of softwareexecution models with support from the operating system (OS):1.

Cluster Switching Mode (CSM) : The processor is logically divided into two clusters,one containing the big cores and the other with the LITTLE cores, but only onecluster is usable at any given time. The OS transparently activates/deactivates theclusters depending on the workload in order to balance performance and energyeﬃciency.2.

CPU migration (CPUM) : The physical cores are grouped into pairs, each consistingof a fast core and a slow core, building VCs to which the OS maps threads. Atany given moment, only one physical core is active per VC, depending on therequirements of the workload. In big.LITTLE speciﬁcations where the number offast and slow cores do not match, the VC can be assembled from a diﬀerent numberof cores of each type. The

In-Kernel Switcher (IKS) is Linaro’s solution for thismodel.3.

Global Task Scheduling (GTS) . This is the most ﬂexible model. All 8 cores are avail-able for thread scheduling, and the OS maps the threads to any of them dependingon the speciﬁc nature of the workload and core availability. ARM’s implementationof GTS is referred to as big.LITTLE MP.Figure 1 oﬀers an schematic view of these three execution models for modern big.LITTLEarchitectures. GTS is the most ﬂexible solution, allowing the OS scheduler to map threadsto any available core or group of cores. GTS exposes the complete pool of 8 (fast andslow) cores in the Exynos 5422 SoC to the OS. This allows a straight-forward port ofexisting threaded application, including runtime task schedulers, to exploit all the com-putational resources in this AMP, provided the multi-threading technology underlyingthe software is based on conventional tools such as, e.g., POSIX threads or OpenMP.Attaining high performance in asymmetric architectures, even with a GTS conﬁguration,is not as trivial, and is one of the goals of this paper.Alternatively, CPUM proposes a pseudo-symmetric view of the Exynos 5422, trans-forming this 8-core asymmetric SoC into 4 symmetric multicore processors (SMPs), whichare logically exposed to the OS scheduler. (In fact, as this model only allows one activecore per VC, but the type of the speciﬁc core that is in operation can diﬀer from one VCto another, the CPUM is still asymmetric.)3

S / Task scheduler (two asymmetricclusters, only one active at a time) (a) CSM

OS scheduler / Task scheduler (4 VCs,one physical core per VC active at a time) (b) CPUM

OS scheduler / Task scheduler (8 asymmetric cores,any active at a time) (c) GTSFigure 1: Operation modes for modern big.LITTLE architectures.

In practice, runtime task schedulers can mimic or approximate any of these OS op-eration modes. A straight-forward model is simply obtained by following the principlesgoverning GTS to map ready tasks to any available core. With this solution, load un-balance can be tackled via ad-hoc (i.e., asymmetry-aware) scheduling policies embeddedinto the runtime that map tasks to the most “appropriate” resource.

3. Parallel Execution of DLA Operations on Multi-threaded Architectures

In this section we brieﬂy review several software eﬀorts, in the form of task-parallelruntimes and libraries, that were speciﬁcally designed for DLA, or have been successfullyapplied in this domain, when the target is (an heterogeneous system or) an AMP . We start by describing how to extract task parallelism during the execution of a DLAoperation, using the Cholesky factorization as a workhorse example. This particularoperation, which is representative of several other factorizations for the solution of linear4ystems, decomposes an n × n symmetric positive deﬁnite matrix A into the product A = U T U , where the n × n Cholesky factor U is upper triangular [10]. void cholesky ( double *A[s][s], int b, int s) { for (int k = 0; k < s; k++) { po_cholesky (A[k][k], b, b); // Cholesky factorization // ( diagonal block) for (int j = k + 1; j < s; j++) tr_solve (A[k][k], A[k][j], b, b); // Triangular solve for (int i = k + 1; i < s; i++) { for (int j = i + 1; j < s; j++) ge_multiply (A[k][i], A[k][j], A[i][j], b, b); // Matrix multiplication sy_update (A[k][i], A[i][i], b, b); // Symmetric rank -b update } } } Listing 1: C implementation of the blocked Cholesky factorization.

Listing 1 displays a simpliﬁed C code for the factorization of an n × n matrix A , storedas s × s (data) sub-matrices of dimension b × b each. This blocked routine decomposesthe operation into a collection of building kernels : po cholesky (Cholesky factoriza-tion), tr solve (triangular solve), ge multiply (matrix multiplication), and sy update (symmetric rank- b update). The order in which these kernels are invoked during theexecution of the routine, and the sub-matrices that each kernel read/writes, result in adirect acyclic graph (DAG) that reﬂects the dependencies between tasks (i.e., instancesof the kernels) and, therefore, the task parallelism of the operation. For example, Fig-ure 2 shows the DAG with the tasks (nodes) and data dependencies (arcs) intrinsic tothe execution of Listing 1, when applied to a matrix composed of 4 × s =4).The DAG associated with an algorithm/routine is a graphical representation of thetask parallelism of the corresponding operation, and a runtime system can exploit thisinformation to determine the task schedules that satisfy the DAG dependencies. Forthis purpose, in OmpSs the programmer employs OpenMP-like directives ( pragma s) toannotate routines appearing in the code as tasks, indicating the directionality of theiroperands (input, output or input/output) by means of clauses. The OmpSs runtime thendecomposes the code (transformed by Mercurium source-to-source compiler) into a num-ber of tasks at run time, dynamically identifying dependencies among these, and issuing ready tasks (those with all dependencies satisﬁed) for their execution to the processorcores of the system.Listing 2 shows the annotations a programmer needs to add in order to exploit taskparallelism using OmpSs; see in particular the lines labelled with ‘ ”. Theclauses in , out and inout denote data directionality, and help the task scheduler tokeep track of data dependencies between tasks during the execution. We note that,in this implementation, the four kernels simply boil down to calls to four fundamentalcomputational kernels for DLA from LAPACK ( dpotrf ) and the BLAS-3 ( dtrsm , dgemm and dsyrk ). 5 igure 2: DAG with the tasks and data dependencies resulting from the application of the code inListing 1 to a 4 × s =4). The labels specify the type of kernel/task with the followingcorrespondence: “ C ” for the Cholesky factorization, “ T ” for the triangular solve, “ G ” for the matrixmultiplication, and “ S ” for the symmetric rank- b update. The subindices (starting at 0) specify thesub-matrix that the corresponding task updates, and the color is used to distinguish between diﬀerentvalues of the iteration index k . For servers equipped with one or more graphics accelerators, (specialized versionsof) the schedulers underlying OmpSs, StarPU, MAGMA, Kaapi and libflame distin-guish between the execution target being either a general-purpose core (CPU) or a GPU,assigning tasks to each type of resource depending on their properties, and applyingtechniques such as data caching or locality-aware task mapping; see, among many oth-ers, [16, 17, 18, 19, 20].The designers/developers of the OmpSs programming model and the Nanos++ run-time task scheduler recently introduced a new version of their framework, hereafter re-ferred to as Botlev-OmpSs, speciﬁcally tailored for AMPs [15]. This asymmetry-consciousruntime embeds a scheduling policy CATS (Criticality-Aware Task Scheduler) that re-lies on bottom-level longest-path priorities, keeps track of the criticality of the individualtasks, and leverages this information, at execution time, to assign a ready task to either acritical or a non-critical queue. In this solution, tasks enqueued in the critical queue canonly be executed by the fast cores. In addition, the enhanced scheduler integrates uni- orbi-directional work stealing between fast and slow cores. According to the authors, thissophisticated ad-hoc scheduling strategy for heterogeneous/asymmetric processors at-tains remarkable performance improvements in a number of target applications; see [15]for further details.When applied to a task-parallel DLA routine, the asymmetry-aware scheduler inBotlev-OmpSs maps each task to a single (big or LITTLE) core, and simply invokes asequential DLA library to conduct the actual work. On the other hand, we note that this6 void po_cholesky ( double *A, int b, int ld) { static int INFO = 0; static const char UP = ’U’; dpotrf (&UP , &b, A, &ld , &INFO); // LAPACK Cholesky factorization } void tr_solve ( double *A, double *B, int b, int ld) { static double DONE = 1.0; static const char LE = ’L’, UP = ’U’, TR = ’T’, NU = ’N’; dtrsm (&LE , &UP , &TR , &NU , &b, &b, &DONE , A, &ld , B, &ld); // BLAS -3 triangular solve } void ge_multiply ( double *A, double *B, double *C, int b, int ld) { static double DONE = 1.0, DMONE = -1.0; static const char TR = ’T’, NT = ’N’; dgemm (&TR , &NT , &b, &b, &b, &DMONE , A, &ld , B, &ld , &DONE , C, &ld); // BLAS -3 matrix multiplication } void sy_update ( double *A, double *C, int b, int ld) { static double DONE = 1.0, DMONE = -1.0; static const char UP = ’U’, TR = ’T’; dsyrk (&UP , &TR , &b, &b, &DMONE , A, &ld , &DONE , C, &ld); // BLAS -3 symmetric rank -b update } Listing 2: Labeled tasks involved in the blocked Cholesky factorization. approach required an important redesign of the underlying scheduling policy (and thus,a considerable programming eﬀort for the runtime developer), in order to exploit theheterogeneous architecture. In particular, detecting the criticality of a task at executiontime is a nontrivial question.

An alternative to the runtime-based (i.e., task-parallel) approach to execute DLAoperations on multi-threaded architectures consists in relying on a library of special-ized kernels that statically partitions the work among the computational resources, orleverages a simple schedule mechanism such as those available, e.g., in OpenMP. ForDLA operations with few and/or simple data dependencies, as is the case of the BLAS-3,and/or when the number of cores in the target architecture is small, this option can avoidthe costly overhead of a sophisticated task scheduler, providing a more eﬃcient solution.Currently this is preferred option for all high performance implementations of the BLASfor multicore processors, being adopted in both commercial and open source packagessuch as, e.g., AMD ACML, IBM ESSL, Intel MKL, GotoBLAS [21], OpenBLAS [22] andBLIS. 7LIS in particular mimics GotoBLAS to orchestrate all BLAS-3 kernels (includingthe matrix multiplication, gemm ) as three nested loops around two packing routines,which accommodate the data in the higher levels of the cache hierarchy, and a macro-kernel in charge of performing the actual computations. Internally, BLIS implements themacro-kernel as two additional loops around a micro-kernel that, in turn, boils down toa loop around a symmetric rank-1 update. For the purpose of the following discussion,we will only consider the three outermost loops in the BLIS implementation of gemm forthe multiplication C := C + A · B , where A, B, C are respectively m × k , k × n and m × n matrices, stored in arrays A , B and C ; see Listing 3. In the code, mc , nc , kc are cacheconﬁguration parameters that need to be adjusted for performance taking into account,among others, the latency of the ﬂoating-point units, number of vector registers, andsize/associativity degree of the cache levels. void gemm ( double A[m][k], double B[k][n], double C[m][n], int m, int n, int k, int mc , int nc , int kc) { double *Ac = malloc (mc * kc * sizeof ( double )), *Bc = malloc (kc * nc * sizeof ( double )); for (int jc = 0; jc < n; jc+=nc) { // Loop 1 int jb = min(n-jc+1, nc); for (int pc = 0; pc < k; jc+=kc) { // Loop 2 int pb = min(k-pc+1, kc); pack_buffB (B[pc][jc], Bc , kb , nb); // Pack A->Ac for (int ic = 0; ic < m; ic+=mc) { // Loop 3 int ib = min(m-ic+1, mc); pack_buffA (A[ic][pc], Ac , mb , kb); // Pack A->Ac gemm_kernel (Ac , Bc , C[ic][jc], mb , nb , kb , mc , nc , kc); // Macro - kernel } } } } Listing 3: High performance implementation of gemm in BLIS.

The implementation of gemm in BLIS has been demonstrated to deliver high per-formance on a wide range of multicore and many-core SMPs [23, 24]. These studiesoﬀered a few relevant insights that guided the parallelization of gemm (and also otherLevel-3 BLAS) on the ARM big.LITTLE architecture under the GTS software executionmodel. Concretely, the architecture-aware multi-threaded parallelization of gemm in [13]integrates the following three techniques: • A dynamic 1-D partitioning of the iteration space to distribute the workload ineither Loop 1 or Loop 3 of BLIS gemm between the two clusters. • A static 1-D partitioning of the iteration space that distributes the workload of oneof the loops internal to the macro-kernel between the cores of the same cluster.8

A modiﬁcation of the control tree that governs the multi-threaded parallelizationof BLIS gemm in order to accommodate diﬀerent loop strides for each type of corearchitecture.The strategy is general and can be applied to a generic AMP, consisting of anycombination of fast/slow cores sharing the main memory, as well as to all other Level-3BLAS operations.To close this section, we emphasize that our work diﬀers from [13, 15] in that we ad-dress sophisticated DLA operations, with a rich hierarchy of task dependencies, by lever-aging a conventional runtime scheduler in combination with a data-parallel asymmetry-conscious implementation of the BLAS-3.

4. Retargeting Existing Task Schedulers to Asymmetric Architectures

In this section, we initially perform an evaluation of the task-parallel Cholesky routinein Listings 1–2, executed on top of the conventional (i.e., default) scheduler in OmpSslinked to a sequential instance of BLIS, on the target Exynos 5422 SoC. The outcome fromthis study motivates the development eﬀort and experiments presented in the remainderof the paper.

Figure 3 reports the performance, in terms of GFLOPS (billions of ﬂops per second),attained with the conventional OmpSs runtime, when the number of worker threadsvaries from 1 to 8, and the mapping of worker threads to cores is delegated to the OS.We evaluated a range of block sizes ( b in Listing 1), but for simplicity we report onlythe results obtained with the value b that optimized the GFLOPS rate for each problemdimension. All the experiments hereafter employed ieee double precision. Furthermore,we ensured that the cores operate at the highest possible frequency by setting the ap-propriate cpufreq governor. The conventional runtime of OmpSs corresponds to release15.06 of the Nanos++ runtime task scheduler. For this experiment, it is lined with the“sequential” implementation of BLIS in release 0.1.5. (For the experiments with themulti-threaded/asymmetric version of BLIS in the later sections, we will use specializedversions of the codes in [13] for slow+fast VCs.)The results in the Figure reveal the increase in performance as the number of workerthreads is raised from 1 to 4, which the OS maps to the (big) Cortex-A15 cores. However,when the number of threads exceeds the amount of fast cores, the OS starts binding thethreads to the slower Cortex-A7 cores, and the improvement rate is drastically reduced,in some cases even showing a performance drop. This is due to load imbalance, as tasksof uniform granularity, possibly laying in the critical path, are assigned to slow cores.Stimulated by this ﬁrst experiment, we recognize that an obvious solution to thisproblem consists in adapting the runtime task scheduler (more speciﬁcally, the schedulingpolicy) to exploit the SoC asymmetry [15]. Nevertheless, we part ways with this solution,exploring an architecture-aware alternative that leverages a(ny) conventional runtimetask scheduler combined with an underlying asymmetric library. We discuss this optionin more detail in the following section. 9

000 2000 3000 40000246810 Problem dimension (n) G F L O PS Cholesky factorization. OmpSs - Sequential BLIS1 Worker thread2 Worker threads3 Worker threads4 Worker threads5 Worker threads6 Worker threads7 Worker threads8 Worker threads

Figure 3: Performance of the Cholesky factorization using the conventional OmpSs runtime and asequential implementation of BLIS on the Exynos 5422 SoC.

Our proposal operates under the GTS model but is inspired in CPUM. Concretely,our task scheduler regards the computational resources as four truly symmetric VCs, eachcomposed of a fast and a slow core. For this purpose, unlike CPUM, both physical coreswithin each VC remain active and collaborate to execute a given task. Furthermore, ourapproach exploits concurrency at two levels: task-level parallelism is extracted by theruntime in order to schedule tasks to the four symmetric VCs; and each task/kernel isinternally divided to expose data-level parallelism , distributing its workload between thetwo asymmetric physical cores within the VC in charge of its execution.Our solution thus only requires a conventional (and thus asymmetry-agnostic) run-time task scheduler, e.g. the conventional version of OmpSs, where instead of spawningone worker thread per core in the system, we adhere to the CPUM model, creating onlyone worker thread per VC. Internally, whenever a ready task is selected to be executedby a worker thread, the corresponding routine from BLIS internally spawns two threads,binds them to the appropriate pair of Cortex-A15+Cortex-A7 cores, and asymmetri-cally divides the work between the fast and the slow physical cores in the VC. Followingthis idea, the architecture exposed to the runtime is symmetric , and the kernels in theBLIS library conﬁgure a “black box” that abstracts the architecture asymmetry from theruntime scheduler.In summary, in a conventional setup, the core is the basic computational resource forthe task scheduler, and the “sequential” tasks are the minimum work unit to be assignedto these resources. Compared with this, in our approach the VC is the smallest (basic)computational resource from the point of view of the scheduler; and tasks are furtherdivided into smaller units, and executed in parallel by the physical cores inside the VCs.10 .2.2. Comparison with other approaches

Our approach features a number of advantages for the developer: • The runtime is not aware of asymmetry, and thus a conventional task schedulerwill work transparently with no special modiﬁcations. • Any existing scheduling policy (e.g. cache-aware mapping or work stealing) in anasymmetry-agnostic runtime, or any enhancement technique, will directly impactthe performance attained on an AMP. • Any improvement in the asymmetry-aware BLIS implementation will directly im-pact the performance on an AMP. This applies to diﬀerent ratios of big/LITTLEcores within a VC, operating frequency, or even to the introduction further levelsof asymmetry (e.g. cores with a capacity between fast and slow).Obviously, there is also a drawback in our proposal as a tuned asymmetry-aware DLAlibrary must exist in order to reuse conventional runtimes. In the scope of DLA, thisdrawback is easily tackled with BLIS. We recognize that, in more general domains, anad-hoc implementation of the application’s fundamental kernels becomes mandatory inorder to fully exploit the underlying architecture.

We ﬁnally note that certain requirements are imposed on a multi-threaded imple-mentation of BLIS that operates under the CPUM mode. To illustrate this, consider the gemm kernel and the high-level description of its implementation in Listing 3. For ourobjective, we still have to distribute the iteration space between the Cortex-A15 and theCortex-A7 but, since there is only one resource of each type per VC, there is no need topartition the loops internal to the macro-kernel. Furthermore, we note that the optimalstrides for Loop 1 are in practice quite large ( nc is in the order of a few thousands forARM big.LITTLE cores), while the optimal values for Loop 3 are much more reduced( mc is 32 for the Cortex-A7 and 156 for the Cortex-A15). Therefore, we target Loop 3 inour data-parallel implementation of BLIS for VCs, which we can expect to easily yield aproper workload balancing.

5. Experimental results

Let us start by reminding that, at execution time, OmpSs decomposes the routine forthe Cholesky factorization into a collection of tasks that operate on sub-matrices (blocks)with a granularity dictated by the block size b ; see Listing 1. These tasks typicallyperform invocations to a fundamental kernel of the BLAS-3, in our case provided byBLIS, or LAPACK; see Listing 2.The ﬁrst step in our evaluation aims to provide a realistic estimation of the potentialperformance beneﬁts of our approach (if any) on the target Exynos 5422 SoC. A criticalfactor from this perspective is the range of block sizes, say b opt , that are optimal forthe conventional OmpSs runtime. In particular, the eﬃciency of our hybrid task/data-parallel approach is strongly determined by the performance attained with the asymmet-ric BLIS implementation when compared against that of its sequential counterpart, forproblem dimensions n that are in the order of b opt .11 able 1: Optimal block sizes for the Cholesky factorization using the conventional OmpSs runtime anda sequential implementation of BLIS on the Exynos 5422 SoC. Problem dimension ( n )512 1,024 1,536 2,048 2,560 3,072 3,584 4,096 4,608 5,120 5,632 6,144 6,656 7,168 7,680

192 384 320 448 448 448 384 320 320 448 448 448 448 384 448

192 192 320 192 448 448 384 320 320 448 448 448 448 384 448

128 192 320 192 384 448 320 320 320 448 448 448 448 384 448

128 128 192 192 192 320 320 320 320 448 320 448 448 384 448

Table 1 reports the optimal block sizes b opt for the Cholesky factorization, withproblems of increasing matrix dimension, using the conventional OmpSs runtime linkedwith the sequential BLIS, and 1 to 4 worker threads. Note that, except for smallestproblems, the observed optimal block sizes are between 192 and 448. These dimensionsoﬀer a fair compromise, exposing enough task-level parallelism while delivering high“sequential” performance for the execution of each individual task via the sequentialimplementation of BLIS.The key insight to take away from this experiments is that, in order to extract goodperformance from a combination of the conventional OmpSs runtime task scheduler witha multi-threaded asymmetric version of BLIS, the kernels in this instance of the asym-metric library must outperform their sequential counterparts, for matrix dimensions inthe order of the block sizes in Table 1. Figure 4 shows the performance attained with thethree BLAS-3 tasks involved in the Cholesky factorization ( gemm , syrk and trsm ) forour range of dimensions of interest. There, the multi-threaded asymmetry-aware kernelsrun concurrently on one Cortex-A15 plus one Cortex-A7 core, while the sequential kernelsoperate exclusively on a single Cortex-A15 core. In general, the three BLAS-3 routinesexhibit a similar trend: the kernels from the sequential BLIS outperform their asymmet-ric counterparts for small problems (up to approximately m , n , k = 128); but, from thatdimension, the use of the slow core starts paying oﬀ. The interesting aspect here is thatthe cross-over threshold between both performance curves is in the range, (usually at anearly point,) of b opt ; see Table 1. This implies that the asymmetric BLIS can potentiallyimprove the performance of the overall computation. Moreover, the gap in performancegrows with the problem size, stabilizing at problem sizes around m , n , k ≈ In order to analyze the actual beneﬁts of our proposal, we next evaluate the conven-tional OmpSs task scheduler linked with either the sequential implementation of BLIS orits multi-threaded asymmetry-aware version. Hereafter, the BLIS kernels from ﬁrst con-ﬁguration always run using one Cortex-A15 core while, in the second case, they exploitone Cortex-A15 plus one Cortex-A7 core. Figure 5 reports the results for both setups,using an increasing number of worker threads from 1 to 4. For simplicity, we only reportthe results obtained with the optimal block size. In all cases, the solution based on themulti-threaded asymmetric library outperforms the sequential implementation for rela-tively large matrices (usually for dimensions n >

00 200 300 400 50000.511.522.53 Problem dimensions (m = n = k) G F L O PS dgemmSeq. BLIS on 1 A15Asym. BLIS on 1 A7 + 1 A15

100 200 300 400 50000.511.522.53 Problem dimensions (m = k) G F L O PS dsyrkSeq. BLIS on 1 A15Asym. BLIS on 1 A7 + 1 A15

100 200 300 400 50000.511.522.53 Problem dimensions (m = n) G F L O PS dtrsmSeq. BLIS on 1 A15Asym. BLIS on 1 A7 + 1 A15 Figure 4: Performance of the BLAS-3 kernels in the sequential and the multi-threaded/asymmetricimplementations of BLIS, using respectively one Cortex-A15 core and one Cortex-A15 plus one Cortex-A7 core of the Exynos 5422 SoC.

GFLOPS rates are similar. The reason for this behavior can be derived from the optimalblock sizes reported in Table 1 and the performance of BLIS reported in Figure 4: forthat range of problem dimensions, the optimal block size is signiﬁcantly smaller, andboth BLIS implementations attain similar performance rates.The quantitative diﬀerence in performance between both approaches is reported inTables 2 and 3. The ﬁrst table illustrates the raw (i.e., absolute) gap, while the second oneshows the diﬀerence per Cortex-A7 core introduced in the experiment. Let us consider,for example, the problem size n = 6,144. In that case, the performance roughly improvesby 0.975 GFLOPS when the 4 slow cores are added to help the base 4 Cortex-A15cores. This translates into a performance raise of 0.243 GFLOPS per slow core, which isslightly under the improvement that could be expected from results experiments in theprevious section. Note, however, that the performance per Cortex-A7 core is reducedfrom 0.340 GFLOPS, when adding just one core, to 0.243 GFLOPS, when simultaneouslyusing all four slow cores. Our last round of experiments aims to assess the performance advantages of diﬀerenttask-parallel executions of the Cholesky factorization via OmpSs. Concretely, we consider13

000 4000 60000246810 Problem dimension G F L O PS Cholesky factorization. Asymmetry-agnostic OmpSs. 1 worker threadOmpSs - Asym. BLISOmpSs - Seq. BLIS 2000 4000 60000246810 Problem dimension G F L O PS Cholesky factorization. Asymmetry-agnostic OmpSs. 2 worker threadsOmpSs - Asym. BLISOmpSs - Seq. BLIS G F L O PS Cholesky factorization. Asymmetry-agnostic OmpSs. 3 worker threadsOmpSs - Asym. BLISOmpSs - Seq. BLIS G F L O PS Cholesky factorization. Asymmetry-agnostic OmpSs. 4 worker threadsOmpSs - Asym. BLISOmpSs - Seq. BLIS

Figure 5: Performance of the Cholesky factorization using the conventional OmpSs runtime linked witheither the sequential or the multi-threaded/asymmetric implementations of BLIS in the Exynos 5422SoC. (1) the conventional task scheduler linked with the sequential BLIS (“OmpSs - Seq.BLIS”); (2) the conventional task scheduler linked with our multi-threaded asymmetricBLIS that views the SoC as four symmetric virtual cores (“OmpSs - Asym. BLIS”); and(3) the criticality-aware task scheduler in Botlev-OmpSs linked with the sequential BLIS(“Botlev-OmpS - Seq. BLIS”). In the executions, we use all four Cortex-A15 cores andevaluate the impact of adding an increasing number of Cortex-A7 cores, from 1 to 4, forBotlev-OmpSs.Figure 6 shows the performance attained by the aforementioned alternatives on theExynos 5422 SoC. The results can be divided into groups along three problem dimensions: • For small matrices ( n = 512, 1,024), the conventional runtime using exclusively thefour big cores (that is, linked with a sequential BLIS library for task execution)attains the best results in terms of performance. This was expected and was alreadyobserved in Figure 5; the main reason is the small optimal block size, enforced bythe reduced problem size, that is necessary in order to expose enough task-levelparallelism. This invalidates the use of our asymmetric BLIS implementation due tothe low performance for very small matrices; see Figure 4. We note that the ad-hocBotlev-OmpSs does not attain remarkable performances either for this dimensionrange, regardless the amount of Cortex-A7 cores used.14 able 2: Absolute performance improvement (in GFLOPS) for the Cholesky factorization using theconventional OmpSs runtime linked with the multi-threaded/asymmetric BLIS with respect to the sameruntime linked with the sequential BLIS in the Exynos 5422 SoC. Problem dimension ( n )512 1,024 2,048 2,560 3,072 4,096 4,608 5,120 5,632 6,144 6,656 7,680 -0.143 0.061 0.218 0.289 0.326 0.267 0.259 0.313 0.324 0.340 0.348 0.300 -0.116 -0.109 0.213 0.469 0.573 0.495 0.454 0.568 0.588 0.617 0.660 0.582 -0.308 -0.233 -0.020 0.432 0.720 0.614 0.603 0.800 0.820 0.866 0.825 0.780 -0.421 -0.440 -0.274 0.204 0.227 0.614 0.506 0.769 0.666 0.975 0.829 0.902 Table 3: Performance improvement per slow core (in GFLOPS) for the Cholesky factorization using theconventional OmpSs runtime linked with the multi-threaded/asymmetric BLIS with respect to the sameruntime linked with the sequential BLIS in the Exynos 5422 SoC.

Problem dimension ( n )512 1,024 2,048 2,560 3,072 4,096 4,608 5,120 5,632 6,144 6,656 7,680 -0.143 0.061 0.218 0.289 0.326 0.267 0.259 0.313 0.324 0.340 0.348 0.300 -0.058 -0.054 0.106 0.234 0.286 0.247 0.227 0.284 0.294 0.308 0.330 0.291 -0.102 -0.077 -0.006 0.144 0.240 0.204 0.201 0.266 0.273 0.288 0.275 0.261 -0.105 -0.110 -0.068 0.051 0.056 0.153 0.126 0.192 0.166 0.243 0.207 0.225 • For medium-sized matrices ( n = 2,048, 4,096), the gap in performance betweenthe diﬀerent approaches is reduced. The variant that relies on the asymmetricBLIS implementation commences to outperform the alternative implementationsfor n =4,096 by a short margin. For this problem range, Botlev-OmpSs is competi-tive, and also outperforms the conventional setup. • For large matrices ( n = 6,144, 7,680) this trend is consolidated, and both asymmetry-aware approaches deliver remarkable performance gains with respect to the conven-tional setup. Comparing both asymmetry-aware solutions, our mechanism attainsbetter performance rate, even when considering the usage of all available cores forthe Botlev-OmpSs runtime version.To summarize, our proposal to exploit asymmetry improves portability and pro-grammability by avoiding a revamp of the runtime task scheduler for AMPs. In addition,our approach renders performance gains which are, for all problems cases, comparablewith those of ad-hoc asymmetry-conscious schedulers; for medium to large matrices,it clearly outperforms the eﬃciency attained with a conventional asymmetry-obliviousscheduler. We next provide further details on the performance behavior of each one of theaforementioned runtime conﬁgurations. The execution traces in this section have allbeen extracted with the

Extrae instrumentation tool and analyzed with the visualizationpackage

Paraver [25]. The results correspond to the Cholesky factorization of a singleproblem with matrix dimension n = 6,144 and block size b = 448.15

12 1024 2048 4096 6144 76800246810 Problem dimension (n) G F L O PS Cholesky factorization. All configurationsOmpSs - Asym. BLISOmpSs - Seq. BLISBotlev-OmpSs - 4+1Botlev-OmpSs - 4+2Botlev-OmpSs - 4+3Botlev-OmpSs - 4+4

Figure 6: Performance (in GFLOPS) for the Cholesky factorization using the conventional OmpSsruntime linked with either the sequential BLIS or the multi-threaded/asymmetric BLIS, and the ad-hoc asymmetry-aware version of the OmpSs runtime (Botlev-OmpSs) linked with the sequential BLIS in theExynos 5422 SoC. The labels of the form “4+x” refer to an execution with 4 Cortex-A15 cores and xCortex-A7 cores.

Figure 7 reports complete execution traces for each runtime conﬁguration. At aglance, a number of coarse remarks can be extracted from the trace: • From the perspective of total execution time (i.e., time-to-solution ), the conven-tional OmpSs runtime combined with the asymmetric BLIS implementation attainsthe best results, followed by the Botlev-OmpSs runtime conﬁguration. It is worthpointing out that an asymmetry-oblivious runtime which spawns 8 worker threads,with no further considerations, yields the worst performance by far. In this case,the load imbalance and long idle periods, especially as the amount of concurrencydecreases in the ﬁnal part of the trace, entail a huge performance penalty. • The ﬂag marks indicating task initialization/completion reveal that the asymmetricBLIS implementation (which employs the combined resources from a VC) requiresless time per task than the two alternatives based on a sequential BLIS. An eﬀect tonote speciﬁcally in the Botlev-OmpSs conﬁguration is the diﬀerence in performancebetween tasks of the same type, when executed by a big core (worker threads 5to 8) or a LITTLE one (worker threads 1 to 4). • The Botlev-OmpSs task scheduler embeds a (complex) scheduling strategy that in-cludes priorities, advancing the execution of tasks in the critical path and, wheneverpossible, assigning them to fast cores (see, for example, the tasks for the factoriza-tion of diagonal blocks, colored in yellow). This yields an execution timeline thatis more compact during the ﬁrst stages of the parallel execution, at the cost oflonger idle times when the degree of concurrency decreases (last iterations of the16 a) OmpSs - Sequential BLIS (8 worker threads)(b) OmpSs - Sequential BLIS (4 worker threads)(c) OmpSs - Asymmetric BLIS (4 worker threads)(d) Botlev-OmpSs - Sequential BLIS (8 worker threads, 4+4)Figure 7: Execution traces of the three runtime conﬁgurations for the Cholesky factorization ( n = 6,144, b = 448). The timeline in each row collects the tasks executed by a single worker thread. Tasks arecolored following the convention in Figure 2; phases colored in white between tasks represent idle times.The green ﬂags mark task initialization/completion. factorization). Although possible, a priority-aware technique has not been appliedin our experiments with the conventional OmpSs setups and remains part of futurework.We next provide a quantitative analysis on the task duration and a more detailedstudy of the scheduling strategy integrated in each conﬁguration. Table 4 reports the average execution time per type of task for each worker thread.These results show that the execution time per individual type of task is considerablyshorter for our multithreaded/asymmetric BLIS implementation than for the alternativesbased on a sequential version of BLIS. The only exception is the factorization of thediagonal block ( dpotrf ) as this is an LAPACK-level routine, and therefore it is notavailable in BLIS. Inspecting the task execution time of the Botlev-OmpSs conﬁguration,17 able 4: Average time (in ms) per task and worker thread in the Cholesky factorization ( n = 6,144, b = 448), for the three runtime conﬁgurations. OmpSs - Seq. BLIS OmpSs - Asym. BLIS Botlev-OmpSs - Seq. BLIS(4 worker threads) (4 worker threads) (8 worker threads, 4+4) dgemm dtrsm dsyrk dpotrf dgemm dtrsm dsyrk dpotrf dgemm dtrsm dsyrk dpotrf wt 0 wt 1 wt 2 wt 3 wt 4 – – – – – – – – 90.97 48.97 48.36 – wt 5 – – – – – – – – 90.61 48.86 48.16 90.78 wt 6 – – – – – – – – 91.28 49.43 47.97 89.58 wt 7 – – – – – – – – 91.60 49.49 48.62 95.43

Avg. we observe a remarkable diﬀerence depending on the type of core tasks are mapped to.For example, the average execution times for dgemm range from more than 400 ms on aLITTLE core, to roughly 90 ms on a big core. This behavior is reproduced for all typesof tasks.

Figure 8 illustrates the task execution order determined by the Nanos++ task sched-uler. Here tasks are depicted using a color gradient, attending to the order in which theyare encountered in the sequential code, from the earliest to the latest.At runtime, the task scheduler in Botlev-OmpSs issues tasks to execution out-of-orderdepending on their criticality. The main idea behind this scheduling policy is to trackthe criticality of each task and, when possible, advance the execution of critical tasksassigning them to the fast Cortex-A15 cores. Conformally with this strategy, an out-of-order execution reveals itself more frequently in the timelines for the big cores than inthose for the LITTLE cores. With the conventional runtime, the out-of-order executionis only dictated by the order in which data dependencies for tasks are satisﬁed.From the execution traces, we can observe that the Botlev-OmpSs alternative suﬀersa remarkable performance penalty due to the existence of idle periods in the ﬁnal part ofthe factorization, when the concurrency in the factorization is greatly diminished. Thisproblem is not present in the conventional scheduling policies. In the ﬁrst stages of thefactorization, however, the use of a priority-aware policy for the Botlev-OmpSs schedulereﬀectively reduces idle times. Table 5 reports the percentage of time each worker threadis in running or idle state. In general, the relative amount of time spent in idle state ismuch higher for Botlev-OmpSs than for the conventional implementations (17% vs. 5%,respectively). Note also the remarkable diﬀerence in the percentage of idle time betweenthe big and LITTLE cores (20% and 13%, respectively), which drives to the conclusionthat the fast cores stall waiting for completion of tasks executed on the LITTLE cores.This fact can be conﬁrmed in the ﬁnal stages of the Botlev-OmpSs trace.The previous observations pave the road to a combination of scheduling policy andexecution model for AMPs, in which asymmetry is exploited through ad-hoc schedulingpolicies during the ﬁrst stages of the factorization –when the potential parallelism ismassive–, and this is later replaced with the use of asymmetric-aware kernels and coarse-18 a) OmpSs - Sequential BLIS (8 worker threads)(b) OmpSs - Sequential BLIS (4 worker threads)(c) OmpSs - Asymmetric BLIS (4 worker threads)(d) Botlev-OmpSs - Sequential BLIS (8 worker threads, 4+4)Figure 8: Task execution order of the three studied runtime conﬁgurations for the Cholesky factorization( n = 6,144, b = 448). In the trace, tasks are ordered according to their appearance in the sequentialcode, and depicted using a color gradient, with light green indicating early tasks, and dark blue for thelate tasks. grain VCs in the ﬁnal stages of the execution, when the concurrency is scarce. Bothapproaches are not mutually exclusive, but complementary depending on the level oftask concurrency available at a given execution stage.

6. Conclusions

In this paper, we have addressed the problem of refactoring existing runtime taskschedulers to exploit task-level parallelism in novel AMPs, focusing on ARM big.LITTLEsystems-on-chip. We have demonstrated that, for the speciﬁc domain of DLA, an ap-proach that delegates the burden of dealing with asymmetry to the library (in our case,using an asymmetry-aware BLIS implementation), does not require any revamp of ex-isting task schedulers, and can deliver high performance. This proposal paves the roadtowards reusing conventional runtime schedulers for SMPs (and all the associated im-provement techniques developed through the past few years), as the runtime only has a19 able 5: Percentage of time per worker thread in idle or running state for diﬀerent runtime conﬁgurationsfor the Cholesky factorization ( n = 6,144, b = 448). Note that wt 0 is the master thread, and thus isnever idle; for it, the rest of the time till 100% percentage is devoted to synchronization , scheduling and thread creation . For the rest of threads, this amount of time is devoted to runtime overhead . OmpSs - Seq. BLIS OmpSs - Asym. BLIS Botlev-OmpSs - Seq. BLIS(4 worker threads) (4 worker threads) (8 worker threads, 4+4) idle running idle running idle running wt 0 – 98.41 – 97.85 – 86.53 wt 1 wt 2 wt 3 wt 4 – – – – 19.26 80.51 wt 5 – – – – 21.12 78.69 wt 6 – – – – 20.84 78.97 wt 7 – – – – 20.09 79.70

Avg. symmetric view of the hardware. Our experiments reveal that this solution is competi-tive and even improves the results obtained with an asymmetry-aware scheduler for DLAoperations.

Acknowledgments

The researchers from Universidad Complutense de Madrid were supported by projectCICYT TIN2012-32180. Enrique S. Quintana-Ort´ı was supported by projects CICYTTIN2011-23283 and TIN2014-53495-R as well as the EU project FP7 318793 “EXA2GREEN”.

References [1] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, D. Burger, Dark silicon and the endof multicore scaling, in: Proc. 38th Annual Int. Symp. on Computer architecture, ISCA’11, 2011,pp. 365–376.[2] M. Duranton, D. Black-Schaﬀer, K. D. Bosschere, J. Maebe, The HiPEAC vision for advancedcomputing in Horizon 2020, (2013).[3] Cilk project home page, http://supertech.csail.mit.edu/cilk/ , last visit: July 2015.[4] OmpSs project home page, http://pm.bsc.es/ompss , last visit: July 2015.[5] StarPU project home page, http://runtime.bordeaux.inria.fr/StarPU/ , last visit: July 2015.[6] PLASMA project home page, http://icl.cs.utk.edu/plasma/ , last visit: July 2015.[7] MAGMA project home page, http://icl.cs.utk.edu/magma/ , last visit: July 2015.[8] Kaapi project home page, https://gforge.inria.fr/projects/kaapi , last visit: July 2015.[9] FLAME project home page, , last visit: July 2015.[10] G. H. Golub, C. F. V. Loan, Matrix Computations, 3rd Edition, The Johns Hopkins UniversityPress, Baltimore, 1996.[11] E. Anderson et al, LAPACK Users’ guide, 3rd Edition, SIAM, 1999.[12] J. J. Dongarra, J. Du Croz, S. Hammarling, I. Duﬀ, A set of level 3 basic linear algebra subprograms,ACM Transactions on Mathematical Software 16 (1) (1990) 1–17.[13] S. Catal´an, F. D. Igual, R. Mayo, R. Rodr´ıguez-S´anchez, E. S. Quintana-Ort´ı, Architecture-awareconﬁguration and scheduling of matrix multiplication on asymmetric multicore processors, ArXive-prints 1506.08988.URL http://arxiv.org/abs/1506.08988

14] F. G. Van Zee, R. A. van de Geijn, BLIS: A framework for rapidly instantiating BLAS functionality,ACM Transactions on Mathematical Software 41 (3) (2015) 14:1–14:33.URL http://doi.acm.org/10.1145/2764454 [15] K. Chronaki, A. Rico, R. M. Badia, E. Ayguad´e, J. Labarta, M. Valero, Criticality-aware dynamictask scheduling for heterogeneous architectures, in: Proceedings of ICS’15, 2015.[16] G. Quintana-Ort´ı, E. S. Quintana-Ort´ı, R. A. van de Geijn, F. G. Van Zee, E. Chan, Program-ming matrix algorithms-by-blocks for thread-level parallelism, ACM Transactions on MathematicalSoftware 36 (3) (2009) 14:1–14:26.[17] R. M. Badia, J. R. Herrero, J. Labarta, J. M. P´erez, E. S. Quintana-Ort´ı, G. Quintana-Ort´ı,Parallelizing dense and banded linear algebra libraries using SMPSs, Concurrency and Computation:Practice and Experience 21 (18) (2009) 2438–2456.[18] C. Augonnet, S. Thibault, R. Namyst, P.-A. Wacrenier, StarPU: A uniﬁed platform for task schedul-ing on heterogeneous multicore architectures, Concurrency and Computation: Practice and Expe-rience 23 (2) (2011) 187–198.[19] S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPUaccelerators, in: Proc. IEEE Symp. on Parallel and Distributed Processing, Workshops and PhdForum (IPDPSW), 2010, pp. 1–8.[20] T. Gautier, J. V. F. Lima, N. Maillard, B. Raﬃn, XKaapi: A runtime system for data-ﬂow taskprogramming on heterogeneous architectures, in: Proc. IEEE 27th Int. Symp. on Parallel andDistributed Processing, IPDPS’13, 2013, pp. 1299–1308.[21] K. Goto, R. A. van de Geijn, Anatomy of a high-performance matrix multiplication, ACM Trans-actions on Mathematical Software 34 (3) (2008) 12:1–12:25.URL http://doi.acm.org/10.1145/1356052.1356053 [22] OpenBLAS, http://xianyi.github.com/OpenBLAS/ , last visit: July 2015.[23] F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van de Geijn, F. D. Igual, M. Smelyanskiy,X. Zhang, M. Kistler, V. Austel, J. Gunnels, L. Killough, The BLIS framework: Experimentsin portability, ACM Trans. Math. Soft.Accepted. Available at .[24] T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, F. G. Van Zee, Anatomy of high-performance many-threaded matrix multiplication, in: Proc. IEEE 28th Int. Symp. on Parallel andDistributed Processing, IPDPS’14, 2014, pp. 1049–1059.[25] Paraver: the ﬂexible analysis tool, ..