[PDF] Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Abstract

Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.

Full PDF

AArchitecture-Aware Conﬁguration and Scheduling ofMatrix Multiplication onAsymmetric Multicore Processors

Sandra Catal´an a , Francisco D. Igual b , Rafael Mayo a ,Rafael Rodr´ıguez-S´anchez a , Enrique S. Quintana-Ort´ı a , a Depto. de Ingenier´ıa y Ciencia de Computadores, Universidad Jaume I, Castell´on,Spain. b Depto. de Arquitectura de Computadores y Autom´atica, Universidad Complutense deMadrid, Spain.

Abstract

Asymmetric multicore processors (AMPs) have recently emerged as an ap-pealing technology for severely energy-constrained environments, especiallyin mobile appliances where heterogeneity in applications is mainstream. Inaddition, given the growing interest for low-power high performance com-puting, this type of architectures is also being investigated as a means toimprove the throughput-per-Watt of complex scientiﬁc applications.In this paper, we design and embed several architecture-aware optimiza-tions into a multi-threaded general matrix multiplication ( gemm ), a keyoperation of the BLAS, in order to obtain a high performance implementa-tion for ARM big.LITTLE AMPs. Our solution is based on the referenceimplementation of gemm in the BLIS library, and integrates a cache-awareconﬁguration as well as asymmetric–static and dynamic scheduling strate-gies that carefully tune and distribute the operation’s micro-kernels amongthe big and LITTLE cores of the target processor. The experimental resultson a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 andCortex-A7 clusters that implements the big.LITTLE model, expose that ourcache-aware versions of gemm with asymmetric scheduling attain important

Email addresses: [email protected] (Sandra Catal´an), [email protected] (Francisco D. Igual), [email protected] (Rafael Mayo), [email protected] (Rafael Rodr´ıguez-S´anchez), [email protected] (Enrique S. Quintana-Ort´ı)

Preprint submitted to Parallel Computing November 5, 2018 a r X i v : . [ c s . PF ] J u n ains in performance with respect to its architecture-oblivious counterpartswhile exploiting all the resources of the AMP to deliver considerable energyeﬃciency. Keywords:

Matrix multiplication, asymmetric multicore processors,memory hierarchy, scheduling, multi-threading, high performancecomputing

1. Introduction

The decay of Dennard scaling [1] during the past decade marked the endof the “GHz race” and the shift towards multicore designs due to their morefavorable performance-power ratio. In addition, the doubling of transistorson chip with each new semiconductor generation, dictated by Moore’s law [2],has only exacerbated the power wall problem [3, 4, 5], leading to the ariseof “dark silicon” [6] and the deployment of heterogeneous facilities for highperformance computing [7, 8].Asymmetric multicore processors (AMPs) are a particular class of hetero-geneous architectures equipped with cores that share the same instruction setarchitecture but diﬀer in micro-architecture, and thus in complexity, perfor-mance, and power consumption. AMPs have received considerable attentionin the last years as a means to improve the performance-power ratio of com-puting systems [9, 10, 11, 12] partly because, in theory, they can delivermuch higher performance for the same power budget, mainly by exploitingthe presence of serial and parallel phases within applications [11].The general matrix multiplication ( gemm ) is a crucial operation for theoptimization of the Level-3 Basic Linear Algebra Subprograms (BLAS) [13],as portable and highly tuned versions of the remaining Level-3 kernels arein general built on top of gemm [14]. In turn, the contents of BLAS con-form a pivotal cornerstone upon which many sophisticated libraries to tacklecomplex scientiﬁc and engineering applications rely [15]. The importance ofBLAS in general, and gemm in particular, is illustrated by the prolongedeﬀorts spent over the past decades to produce carefully tuned commerciallibraries for almost any current architecture (e.g., Intel’s MKL [16], AMD’s According to this deﬁnition, servers equipped with one (or more) general-purposemulticore processor(s) and a PCIe-attached graphics accelerator, or systems-on-chip likethe NVIDIA Tegra TK1, are excluded from this category. gemm on an ARM big.LITTLE AMP consisting of a cluster composed ofa few fast (big) cores and a complementary cluster with several slow (LIT-TLE) cores, shared main memory, and private L1/L2 caches per core/cluster,respectively. Our approach leverages the multi-threaded implementation of gemm in the BLIS library, which decomposes the operation into a collectionof nested loops around a micro-kernel . In this reference code, we modifythe loop stride conﬁguration and scheduling to distribute the micro-kernelscomprised by certain loops among the big/LITTLE clusters and cores whiletaking into account the processor’s computational power and cache organi-zation. In more detail, this work makes the following speciﬁc contributions: • Our optimized implementations modify the control tree structure thatgoverns the multi-threaded parallelization of BLIS gemm in order toaccommodate cache-aware conﬁgurations of the loop strides for eachtype of core architecture that match the organization of its cache hier-archy. • We integrate two alternative scheduling strategies, asymmetric–staticand dynamic, to produce a 1-D partitioning of (the iteration space for)one of the outer loops of BLIS gemm between the two clusters thatyields a balanced distribution of the micro-kernels. Furthermore, weapply an orthogonal symmetric–static schedule to map the workload ofone of the inner loops across the cores of the same cluster. • We demonstrate the practical beneﬁts of the cache-aware conﬁgura-tions and asymmetry-aware scheduling strategies on the Exynos 5422,a system-on-chip (SoC) consisting of an ARM Cortex-A15 quad core(big) cluster and an ARM Cortex-A7 quad core (LITTLE) cluster. Ourexperimental results show that the performance attained by the opti-mized gemm on this platform is well beyond that of an architecture-oblivious multi-threaded implementation and close to that of an idealscenario. • We include an analysis of the energy eﬃciency of the asymmetric ar-chitecture when running our optimized gemm , using the GFLOPS/W3billions of ﬂoating-point arithmetic operations, or ﬂops, per secondand Watt) metric, which assesses the energy cost of ﬂops.To conclude, we emphasize that the scheduling approaches proposed in thispaper are general and, in combination with the BLIS implementation of gemm , can be ported with little eﬀort to present and future instances of theARM big.LITTLE architecture as well as to any other asymmetric designin general (e.g. the Intel QuickIA prototype [25]). Furthermore, we areconﬁdent that the principles underlying our scheduling decisions carry overto all Level-3 BLAS operations.The rest of the paper is structured as follows. In Section 2, we compareour approach to optimize gemm on AMPs with state-of-the-art works on sim-ilar architectures. In Section 3, we describe the mechanisms that underliethe original multi-threaded implementation of gemm in the BLIS framework,and evaluate its performance and optimal cache parameter conﬁguration forthe Cortex-A15 and Cortex-A7 clusters. In Section 4, we investigate theeﬀect of using standard, architecture-oblivious multi-threaded BLAS imple-mentations on AMPs, and its negative impact on performance and energyeﬃciency. In Section 5, we introduce our strategies to adapt the originalBLIS multi-threaded implementation to the asymmetric architecture, andreport the performance and energy-eﬃciency results of the new codes. Fi-nally, Section 6 closes the paper with a few concluding remarks and proposalsfor future work.

2. Related Work

Heterogeneous (and asymmetric) architectures are an active research topic,with a vast design space that needs careful consideration in terms of power,performance, programmability, and ﬂexibility [26]. Many of these workscan be grouped into i) eﬀorts to experimentally evaluate the computationalperformance and/or power-energy eﬃciency of AMPs using multi-threadedbenchmarks and applications; and ii) contributions related to workload-partitioning strategies for the execution of gemm on heterogeneous platforms.In the ﬁrst group, Winter et al. [12] discuss power management techniquesand thread scheduling for AMPs; and scheduling on AMP architectures isexplored in a number of works; see, among others, [27, 28, 29, 12] and thereferences therein. 4n the second group, mapping gemm in an heterogeneous cluster is an-alyzed in [30], while a theoretical study of dynamic scheduling applied to gemm in a similar scenario is introduced in [31].Compared with previous work, our investigation aims to deliver an imple-mentation of gemm , based on an open source BLAS library (BLIS), that ishighly optimized for asymmetric ARM big.LITTLE architectures. All previ-ous eﬀorts to implement and evaluate gemm on AMPs employ simple codes,at best tuned via very basic tiling techniques, and therefore yield subopti-mal codes. The research with heterogeneous clusters targets a more generaland complex problem, and in practice can hardly be expected to produce anoptimal solution for AMPs.

3. Multi-Threaded Portable Implementation of BLIS gemm

Modern high-performance implementations of gemm for general-purposearchitectures follow the design pioneered by GotoBLAS [20]. BLIS in par-ticular implements the gemm C += A · B , where the sizes of A , B , C arerespectively m × k , k × n , m × n , as three nested loops around a macro-kernel plus two packing routines (see Loops 1–3 in Figure 1). The macro-kernel isthen implemented in terms of two additional loops around a micro-kernel (Loops 4 and 5 in Figure 1). In BLIS, the micro-kernel is typically encodedas a loop around a rank–1 (i.e., outer product) update using assembly orwith vector intrinsics, while the remaining ﬁve loops and packing routinesare implemented in C; see [23] for further details.Figure 2 illustrates how the loop ordering, together with the packingroutines and an appropriate choice of the BLIS cache conﬁguration parame-ters orchestrate a regular pattern of data transfers through the levels of thememory hierarchy. In practice, the cache parameters n c , k c , m c , n r and m r (which dictate the strides of the ﬁve outermost loops) are adjusted takinginto account the latency of the ﬂoating-point units (FPUs), number of vectorregisters, and size/associativity degree of the cache levels. The goal is thata k c × n r micro-panel of B c , say B r , and the m c × k c macro-panel A c arestreamed into the FPUs from the L1 and L2 caches, respectively; while the k c × n c macro-panel B c resides in the L3 cache (if present). By appropriatelychoosing the conﬁguration parameters, these transfers are fully amortizedwith enough computation from within the micro-kernel; see [32].5 oop 1 for j c = 0 , . . . , n − in steps of n c Loop 2 for p c = 0 , . . . , k − in steps of k c B ( p c : p c + k c − , j c : j c + n c − → B c // Pack into B c Loop 3 for i c = 0 , . . . , m − in steps of m c A ( i c : i c + m c − , p c : p c + k c − → A c // Pack into A c Loop 4 for j r = 0 , . . . , n c − in steps of n r // Macro-kernelLoop 5 for i r = 0 , . . . , m c − in steps of m r C c ( i r : i r + m r − , j r : j r + n r −

1) // Micro-kernel+= A c ( i r : i r + m r − , k c − · B c (0 : k c − , j r : j r + n r − endforendforendforendforendfor Figure 1: High performance implementation of gemm in BLIS. In the code, C c ≡ C ( i c : i c + m c − , j c : j c + n c −

1) is just a notation artifact, introduced to ease the presentation ofthe algorithm, while A c , B c correspond to actual buﬀers that are involved in data copies. gemm in BLIS BLIS allows to select, at execution time, which of the ﬁve loops of gemm are parallelized. Several loops can be simultaneously executed in parallel inorder to adapt the execution to speciﬁc properties of the target architecture.By default, when one of the loops is parallelized, a static partitioning andmapping of iteration chunks to threads is performed prior to the executionof the loop.The multi-threaded version of gemm integrated in BLIS has been previ-ously analyzed for conventional symmetric multicore processors (SMPs) [33]and modern many-threaded architectures [34]. In both “types” of architec-tures, the parallel implementations exploit the concurrency available in thenested ﬁve–loop organization of gemm at one or more levels (i.e., loops).Furthermore, the approach takes into account the cache organization of thetarget platform (e.g., the presence of multiple sockets, which cache levelsare shared/private, etc.), while discarding the parallelization of loops thatwould incur into race conditions as well as loops with options that exhibittoo-ﬁne granularity. The insights gained from these analyses [33, 34] aboutthe loop(s) to parallelize in a multi-threaded implementation of gemm canbe summarized as follows: • Loop 5 (indexed by i r ). With this option, diﬀerent threads executeindependent instances of the micro-kernel, while accessing the samemicro-panel B r in the L1 cache. The amount of parallelism in this6 (cid:0)(cid:0)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1) (cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1) C r BA b r B C MemoryL3 cacheL2 cacheRegistersL1 cache a r C Pack BPack A micro−kernelmicro−kernel C C C A i = 0 r Load fromLoad frommicro−kernelLoad from micro−kernelLoad from

Figure 2: Data movement involved in the BLIS implementation of gemm . case, (cid:100) m c m r (cid:101) , is scarce as, for many architectures, the optimal value for m c is in the order of a few hundreds. • Loop 4 (indexed by j r ). Diﬀerent threads operate on independent in-stances of the micro-kernel, but access the same macro-panel A c in theL2 cache. The time spent in this loop amortizes the cost of packing(and, therefore, moving) A c from main memory into the L2 cache. Theamount of parallelism, (cid:100) n c n r (cid:101) , is in general larger than in the previouscase, as n c is in the order of several hundreds up to a few thousandsfor many architectures. 7 Loop 3 (indexed by i c ). Each thread packs a diﬀerent macro-panel A c into the L2 cache and executes a diﬀerent instance of the macro-kernel.The number of iterations of this loop is not constrained by the cacheparameters, but instead depends on the problem dimension m . When m is less than the product of m c and the degree of parallelization of theloop, A c will be smaller than the optimal dimension and performancemay suﬀer. When there is a shared L2 cache, the size of A c will haveto be reduced by a factor equal to the degree of parallelization of thisloop. However, reducing m c is equivalent to parallelizing the ﬁrst looparound the micro-kernel. • Loop 2 (indexed by p c ). This is not a good choice because multiplethreads simultaneously update the same parts of C , requiring a mech-anism to prevent race conditions. • Loop 1 (indexed by j c ). From a data-sharing perspective, this op-tion is equivalent to extracting the parallelism outside of BLIS. Thisparallelization is reasonable in a multi-socket system where each CPU(socket) has a separate L3 cache.To sum up, these are general guidelines to decide which loops are theo-retically good candidates to be parallelized in order to fully exploit the cachehierarchy of a target architecture. At a glance, the appropriate combina-tion of loops to parallelize strongly depends on which caches are private orshared. Usually, Loop 1 is a good candidate in a multi-socket platform withon-chip L3 caches; Loop 3 should be parallelized when each core has its ownL2 cache; and Loops 4 and 5 are convenient choices if the cores share the L2cache. The ODROID-XU3 board employed in our experiments contains a Sam-sung Exynos 5422 SoC with an ARM Cortex-A15 quad-core processing clus-ter (running at 1.6 GHz in our setup) and a Cortex-A7 quad-core processingcluster (running at 1.4 GHz). Both clusters access a shared DDR3 RAM(2 Gbytes) via 128-bit coherent bus interfaces. Each ARM core (eitherCortex-A15 or Cortex-A7) has a 32+32-Kbyte L1 (instruction+data) cache.The four ARM Cortex-A15 cores share a 2-Mbyte L2 cache, while the fourARM Cortex-A7 cores share a smaller 512-Kbyte L2 cache; see Figure 3. All8 ortex A−732+32Kb L1 Cortex A−732+32Kb L1Cortex A−732+32Kb L1Cortex A−732+32Kb L12Mb L2 cacheExynos 5422 System−on−Chip512Kb L2 cacheCortex−A15 Quad CPU Cortex−A7 Quad CPU128−bit Bus Interface 128−bit Bus InterfaceCortex A−1532+32Kb L1 Cortex A−1532+32Kb L1Cortex A−1532+32Kb L1 Cortex A−1532+32Kb L1

Figure 3: Exynos 5422 block diagram. our tests hereafter employ ieee double-precision arithmetic and square ma-trices of order r = m = n = k . We ensure that the cores run at their highestfrequency by setting the Linux performance governor with the appropriatefrequency limits. Codes are instrumented with the pmlib [35] framework,which collects power consumption data corresponding to instantaneous powerreadings from four independent sensors in the board (for the Cortex-A7 cores,Cortex-A15 cores, DRAM and GPU), with a sampling rate of 250 ms. An initial step in order to attain high performance with the implementa-tion of BLIS gemm is, given a target precision (single or double), determinethe conﬁguration parameters n c , k c , m c , n r , and m r for a single ARM core ofeach type, Cortex-A15 and Cortex-A7, that ﬁt the cache organization. Wenext describe our experimental eﬀort towards this goal. A recent study [36]shows that, in principle, this optimization is also possible via analytic deriva-tion.The ﬁrst aspect to note is that, in this architecture, n c plays a minorrole and, therefore, can be simply set to n c = 4 , n c is connected to the dimension of the L3 cache, whichis not present in the Exynos 5422 SoC. Furthermore, the micro-kernels forthese core architectures and precision are thoroughly tuned with m r = 4 and n r = 4. In consequence, the optimization of gemm in a single-core scenarioboils down to determining the optimal values of m c and k c for each type ofcore. For this purpose, we performed independent empirical searches using9

128 288 448 608 768 928 1120 1312 1504 1696 18880326496128160192224256288320352384416448480512

Coarse−grain search in Cortex−A15 kc m c Coarse−grain search in Cortex−A7 kc m c

608 640 672 704 736 768 800 832 864 896 928 960 9928096112128144160176192208224240

Fine−grain refinement in Cortex−A15 kc m c

176 208 240 272 304 336 368 400 432 464 49648648096112128144160176

Fine−grain refinement in Cortex−A7 kc m c Figure 4: BLIS optimal cache conﬁguration parameters m c and k c for the ARM Cortex-A15 (left) and Cortex-A7 (right) in the Samsung Exynos 5422 SoC. The performanceranges from red (lowest GFLOPS) to green (highest GFLOPS); the optimal ( m c , k c ) pairis marked as a blue dot. a single Cortex-A15 core and a single Cortex-A7 core. In both cases, weinitially applied a coarse-grain search to detect potential optimal regions,and the selected regions were further explored next with a ﬁner granularityto detect the optimal conﬁguration parameters. The result of this processis illustrated in Figure 4, where the top and bottom plots correspond tothe coarse search and the ﬁne-grain reﬁnement respectively. Performance ismeasured hereafter in terms of GFLOPS.The optimal conﬁgurations were detected at m c = 152 , k c = 952 for theCortex-A15 core and m c = 80 , k c = 352 for the Cortex-A7 core. As could beexpected, the optimal values for the Cortex-A15 core are larger than those ofthe Cortex-A7 core, since the L2 cache of the former is four times bigger. Forboth types of cores, the corresponding dimensions and the associativy-degreeof the caches allow that the micro-panel B r ( k c × n r ) ﬁts into the L1 cachewhile the macro-panel A c ( m c × k c ) resides into the L2 cache.10 .4. Multi-threaded BLIS performance on the big and LITTLE clusters After determining the optimal conﬁguration parameters for each corecache organization, we analyze the performance and energy eﬃciency of amulti-threaded implementation of BLIS gemm that operates in a homo-geneous (symmetric) system consisting of up to four cores from either theCortex-A15 cluster or the Cortex-A7 cluster. In particular, given the guide-lines summarized in Section 3.1, and the fact that the L2 cache is sharedamong the cores of the same cluster, we adopt a static parallelization ofLoop 4 using 1–4 threads/cores. Similar qualitative conclusions were ob-tained from a static parallelization of Loop 5. We note that, although thetwo types of clusters are evaluated in isolation in this section, the perfor-mance GFLOPS ﬁgures will be of interest for the asymmetric-aware versionsof gemm that will be presented in Sections 4 and 5, as their aggregationcan be considered as an ideal scenario for the peak performance that can beextracted from the complete asymmetric SoC.The plots in Figure 5 show the performance and energy eﬃciency of themulti-threaded gemm implementation in BLIS when using the Cortex-A15and the Cortex-A7 clusters in isolation. We note that, when calculatingthe energy eﬃciency of one type of cluster, the energy consumed by thecomplementary (idle) cluster is also accounted for, so that we are reportingthe energy eﬃciency of the complete SoC.Focusing on performance ﬁrst, the results expose that the Cortex-A15cores deliver considerable higher performance than their Cortex-A7 coun-terparts. Speciﬁcally, the former type of cores renders an increase of 2 . . . . G F L O PS Problem dimension (r)GEMM GFLOPSA15 (4 threads)A15 (3 threads)A15 (2 threads)A15 (1 threads) A7 (4 threads)A7 (3 threads)A7 (2 threads)A7 (1 threads)(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WA15 (4 threads)A15 (3 threads)A15 (2 threads)A15 (1 threads) A7 (4 threads)A7 (3 threads)A7 (2 threads)A7 (1 threads)(cid:10)

Figure 5: Performance (left) and energy eﬃciency (right) of the BLIS gemm using exclu-sively one type of core, for a varying number of threads. of the complete cluster. It is also worth emphasizing that the exploitationof four Cortex-A7 cores delivers signiﬁcantly higher energy eﬃciency thanan alternative that leverages a single Cortex-A15 core, though the overallperformance of the former option is slightly worse.In general, these graphs reveal that the performance achieved by thecomplete Cortex-A15 cluster is roughly four times that of the Cortex-A7cluster but their energy eﬃciency is similar. This last observation is inter-esting since, a priori, one could expect that the Cortex-A7 cluster was moreenergy-eﬃcient than the Cortex-A15 cluster. However, we would like to re-mark that all our experiments report the energy eﬃciency of the completeSoC, and that the Cortex-A15 cluster in idle state already dissipates morepower than a single Cortex-A7 core in execution.

4. Architecture-Oblivious BLIS gemm on the big.LITTLE SoC

The default approach adopted by BLIS to map gemm on a multi-threadedCPU (see Section 3.1) presents two main drawbacks when applied to simul-taneously leverage the asymmetric cores of an AMP: • BLIS relies on a static partitioning and mapping of the loop itera-tion space among the threads, oblivious of the computational power ofthe cores these iteration chunks are assigned to. Therefore, indepen-dently of the chunk size and the speciﬁc loops that are parallelized,this strategy can only yield an unbalanced distribution of the workload(basically, the micro-kernels) among the asymmetric cores.12

In addition, BLIS employs constant values for the loop strides that, inorder to attain high performance, need to match the optimal conﬁgura-tion parameters determined by the core cache organization. However,given that we face a system with two diﬀerent architectures (Cortex-A15 and Cortex-A7), and thus diﬀerent optimal cache parameters, wewould ideally need to use diﬀerent loop strides/conﬁguration parame-ters for each type of core.The following experiment is designed to expose the negative impact ofthese two mismatches between the BLIS approach and the Exynos 5422 SoCon the performance and energy behavior of gemm . For the evaluation, giventhe guidelines in Section 3.1 and the lack of an L3 cache in this chip, weadopt the following two-level parallelization strategy: • Coarse-grain (or inter-cluster): Loop 1 is tackled using to statically distribute its iteration space between the two clus-ters. This loop (and also Loop 3) is a good candidate for parallelizationacross cores with a proprietary and isolated L2 cache, as is the case ofeach cluster in the Exynos 5422 SoC. • Fine grain (or intra-cluster): Loop 4 is parallelized using up to to statically distribute its iteration space among the fourcores of the same cluster. This loop (as well as Loop 5) is a goodcandidate for parallelization across cores sharing a common L2 cache,as is the case of cores in the same cluster of the Exynos 5422 SoC.In addition, the cache conﬁguration parameters are set to those that areoptimal for the Cortex-A15. We note that similar qualitative observationswere obtained when parallelizing the alternative three combinations of loops1/3 and 4/5; and/or when the cache parameters were conﬁgured using theoptimal values for the Cortex-A7.Figure 6 illustrates the implications of the previous scheduling strategy interms of loop partitioning and assignment to threads. In total, eight threadsare created and binded to the cores so that we are extracting in overall within BLIS. Note how the iteration space for all loops ishomogeneously distributed across the cores (i.e., without taking into accountthe core type).Figure 7 reports the performance and energy eﬃciency using the (two-level) symmetric-static scheduling ( sss ) that parallelizes loops 1 and 4. For13 teration Space / Thread assignment

Loop 1 j c (cid:2) , n (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) n ,n (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th Th , Th , Th , Th Loop 3 i c (cid:2) ,m (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th , Th , Th , Th , Th Loop 4 j r (cid:2) ... nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc ,nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc ,nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th Th Th Th Th Th Th Th Loop 5 i r (cid:2) ,mc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th , Th , Th , Th , Th Figure 6: Partitioning of the iteration space and assignment to threads/cores for a multi-threaded BLIS implementation with 8-way parallelism that combines 2-way parallelismfrom Loop 1 and 4-way parallelism from Loop 4. Threads in green and red are respectivelymapped to big and LITTLE cores. reference, we also include the results from the parallelization of Loop 4 thatseparately exploits either the four cores in the Cortex-A15 cluster or thefour cores in the Cortex-A7 cluster (see Section 3). The “

Ideal ” line in theperformance graph corresponds to the aggregated performance of the con-ﬁgurations that use four cores of each of the two types in isolation (i.e., theperformance of the four Cortex-A15 cores plus the performance of the fourCortex-A7 cores). This is a theoretical upper bound for the performance thatcan be attained when using an optimal scheduling strategy that exploits theasymmetry of the architecture.This experiment reveals that a naive symmetric-static workload distri-bution, which does not consider either the diﬀerences in the cache hierarchybetween the Cortex-A15 and the Cortex-A7, exploits the full system (8 cores)to deliver only about 40% of the highest performance that is observed whenemploying only the four Cortex-A15 cores. The reason is that, with thisapproach, BLIS performs a static partitioning and mapping of the iteration14 G F L O PS Problem dimension (r)GEMM GFLOPSA7 (4 threads)A15 (4 threads)IdealA7 + A15 (4+4 threads)(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WA7 (4 threads)A15 (4 threads)A7 + A15 (4+4 threads)(cid:10)

Figure 7: Performance (left) and energy eﬃciency (right) of the reference BLIS gemm using exclusively one type of core in isolation, and the sss version with a coarse-grainparallelization of Loop 1 and the ﬁne-grain parallelization of Loop 4 using 4 threads percluster. space to the processing cores in a homogeneous manner. This causes a se-vere workload imbalance, as the threads running on the Cortex-A15 rapidlyprocess their chunks, but then have to wait for the threads running on theslow Cortex-A7 cores to complete their work. The energy eﬃciency of thenaive solution is also dramatically aﬀected, and this conﬁguration achievesthe worst energy results. In conclusion, this experiment naturally motivatesthe need of an eﬃcient alternative to the homogeneous sss partitioning ofthe iteration space integrated in the original multi-threaded implementationof BLIS gemm .

5. Architecture-Aware Optimization of BLIS gemm on the big.LITTLESoC

In this section, we brieﬂy review the control mechanism that governsthe parallelization of BLIS gemm . Next, we propose and integrate twoasymmetry-aware strategies for workload scheduling of the BLIS gemm micro-kernels as well as a cache-aware conﬁguration for AMPs; and we evaluate theimpact of these techniques on performance and energy eﬃciency. The opti-mized implementations can be described, at a high level, as follows: • Static-asymmetric scheduling ( sas ). This option statically partitionsand assigns loop iterations to diﬀerent thread types based on the per-formance diﬀerence between fast and slow cores.15 Cache-aware static-asymmetric scheduling ( ca-sas ). This strategy en-hances sas by adapting the loop strides to the distinct cache conﬁgu-rations of the two computing clusters. • Cache-aware dynamic-asymmetric scheduling ( ca-das ). This optionimproves the previous ones by replacing the static partitioning of theiteration space with a dynamic workload distribution across clusters. The execution of all BLIS routines, including gemm , is commanded bya control-tree . This is a recursive data structure that encodes all the infor-mation necessary to combine the basic building blocks oﬀered by the BLISframework in order to implement high-performance algorithms for virtuallyany BLAS-like operation. The control tree for a given BLAS-3 operationgoverns, among others, which combination of loops are to be executed tocomplete the operation (that is, the exact algorithmic variant to execute ateach level of the general algorithm), the loop stride for each loop (speciﬁcto each target architecture), and the exact points at which packing must oc-cur. In addition, for multi-threaded BLIS implementations, the control treedeﬁnes which loops need to be parallelized and the level of concurrency toextract at each point of the algorithm.A key property of the control trees is that they can be leveraged to mod-ify these parameters without aﬀecting the rest of the BLAS implementation,boosting programmer’s productivity and enhancing ﬂexibility. In our modiﬁ-cations of the BLIS framework, we have exploited this abstraction mechanismin order to encode the diﬀerences between the original framework and ourversions adapted for AMPs. In particular, we next focus on the necessarymodiﬁcations and requirements to implement an asymmetric scheduling ofthe loop iteration space to fast and slow cores, and the modiﬁcation of theloop strides in order to develop a cache-aware conﬁguration for BLIS gemm . sas ) Taking into account the experiment in Section 4, we have revamped theoriginal multi-threaded implementation of BLIS gemm to distinguish be-tween the distinct computational power of the two types of cores included inthe ARM big.LITTLE architecture. In particular, the sas version of BLIS gemm integrates the following two new features, which modify the behavior16f the default asymmetry-oblivious multi-threaded implementation at execu-tion time: i) a mechanism to create “fast” and “slow” threads, which arebound upon initialization of the library to the big and LITTLE cores, re-spectively; and ii) a mechanism to decide which one of the loops that areparallelized needs to be partitioned and assigned to fast/slow cores asym-metrically. The number of iteration chunks assigned to threads will thus nolonger be the same. Instead, these numbers will be assigned according to thecapabilities of each type of core.Our reimplementation also comprises, as an conﬁguration knob, an inter-face to specify the ratio of performance between big and LITTLE cores. Forthe speciﬁc loop that is selected as candidate to partition the computationalworkload between the two clusters, this conﬁguration parameter controls thenumber of iteration chunks that are assigned to each cluster. The amount ofthreads/cores of each type, performance ratio and speciﬁc loop to be asym-metrically partitioned can thus be modiﬁed via ad-hoc environment variables,and they can all be ﬁxed at execution time in order to tune the behavior ofthe library to other speciﬁc big.LITTLE setups (for example, to changes inthe core frequency that aﬀect the performance ratio between core types).This new functionality is fully conﬁgurable and has been embedded intothe internal control tree structures that govern the parallelization of eachloop in the general BLIS gemm algorithm. Given the memory organization of the Exynos 5422 SoC, and the guide-lines given for the parallelization of BLIS gemm in section 3, we evaluatedthe following parallelization options for sas : • Coarse-grain: the micro-kernels of the complete multiplication are dis-tributed among the Cortex-A15 and Cortex-A7 clusters by parallelizingeither Loop 1 or Loop 3, with a diﬀerent number of iterations of theparallelized loop assigned to each cluster ( ). • Fine-grain: the execution of each macro-kernel is partitioned amongthe cores of the same type by parallelizing Loop 4, Loop 5, or both( ).To illustrate this, Figure 8 depicts the distribution of the iteration spaceacross fast and slow threads for an scenario in which the iteration space ofLoop 1 is asymmetrically distributed across fast and slow threads, using a17 teration Space / Thread assignment

Loop 1 − Ratio ⇐ ===== + Ratio ===== ⇒ j c (cid:2) , n (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) n ,n (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th Th , Th , Th , Th Loop 3 i c (cid:2) ,m (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th , Th , Th , Th , Th Loop 4 j r (cid:2) ... nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc ,nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc , nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) (cid:2) nc ,nc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th Th Th Th Th Th Th Th Loop 5 i r (cid:2) ,mc (cid:1)(cid:122) (cid:125)(cid:124) (cid:123) Th , Th , Th , Th , Th , Th , Th , Th Figure 8: Partitioning of the iteration space and assignment to threads/cores for a multi-threaded BLIS implementation with 8-way parallelism that asymmetrically combines 2-way parallelism from Loop 1 (using a ratio between fast and slow cores of 3) and 4-wayparallelism from Loop 4. ratio 3, so that the fast threads are assigned three times more computationsthan the slow threads. Internally, Loop 4 is parallelized to distribute thework among the cores of the same cluster. sas

The combination of the coarse-grain and ﬁne-grain parallelization strate-gies for sas yields four direct parallelization schemes. Additionally, two moreconﬁgurations are possible, combining the coarse-grain parallelization of ei-ther Loop 1 or Loop 3 with the ﬁne-grain parallelization of both Loops 4and 5. For brevity, because the qualitative conclusions that can be extractedfrom these parallelization strategies are very similar, we only report resultswhen the iteration space is distributed between the clusters in Loop 1; andthe macro-kernel is partitioned among homogeneous cores in Loop 4, using(distribution) ratios for the coarse-grain parallelization that range from 1to 7. 18 G F L O PS Problem dimension (r)GEMM GFLOPSSAS - Ratio 7SAS - Ratio 6SAS - Ratio 5SAS - Ratio 4 SAS - Ratio 3SAS - Ratio 2SAS - Ratio1Ideal(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WSAS - Ratio 7SAS - Ratio 6SAS - Ratio 5SAS - Ratio 4 SAS - Ratio 3SAS - Ratio 2SAS - Ratio1

Figure 9: Performance (left) and energy-eﬃciency (right) of the sas version of BLIS gemm with a coarse-grain parallelization of Loop 1 and a ﬁne-grain parallelization of Loop 4 using4 threads per cluster.

Figure 9 displays the results for this experiment. The performance re-sults show that, when the appropriate workload distribution is applied, theasymmetric-aware sas outperforms the peak performance of all other con-ﬁgurations, being close to that of the ideal case. In particular, the left-handside graph reveals that the worst performance is achieved when the ratiois 1 (i.e., an homogeneous inter-cluster prallelization). Also, the performancegrows until a ratio of 5–6 is used, and above 6, in general declines with a lowerlimit existing at the performance line delivered by the Cortex-A15 cluster inisolation (not included in the ﬁgure for clarity). These results indicate thatratios below 5 map that too much workload to the Cortex-A7 cluster, andratios above 6 assign an excessive workload to the Cortex-A15 cluster.For the largest tested problem, the increment of performance for sas compared with the conﬁguration that employs four Cortex-A15 cores only isclose to 20%. However, sas oﬀers lower performance for the small problems,as the chunks assigned to the big and LITTLE cores are, in those cases, toosmall to exploit the asymmetric architecture.In terms of energy eﬃciency, when the appropriate workload distributionis in place, sas delivers the same ﬂops per Joule as the setup that exclusivelyemploys the Cortex-A15 cluster. On the other hand, when the workload isunbalanced, the energy performance is greatly aﬀected, as the fast threadsremain idle but active, polling and consuming energy, till the slow threadscomplete their work. 19 .3. Cache-aware static-asymmetric scheduling ( ca-sas ) The original implementation of BLIS contains a single control-tree peroperation, which implies that the gemm routine can only employ using theoptimal cache conﬁguration parameters for either the Cortex-A15 or theCortex-A7. Our solution to this problem duplicates the control structureto set diﬀerent conﬁguration values for m c and k c , depending on the type ofcore. Speciﬁcally, two diﬀerent control-trees are created upon initialization,for “fast” and “slow” threads, each setting the optimal loop strides/cacheparameters for a diﬀerent core architecture (see Section 3). In addition, thismechanism opens the door to the use of speciﬁc highly-tuned micro-kernelsadapted to each micro-architecture in the AMP (and, therefore, optimal val-ues for m r and n r ), depending on the type of core that executes it. Wenote that, as argued earlier in Section 3, the performance of gemm is quiteindependent of n c , since there is not a L3 cache in the Exynos 5422 SoC.Furthermore, we leverage the same micro-kernel for both the Cortex-A7 andCortex-A15 clusters since, in this particular SoC, it is optimal for both.An important caveat of this approach is that there may be some depen-dencies between the optimal conﬁgurations used for the clusters. Concretely,if the micro-kernels are distributed among the Cortex-A15 and Cortex-A7clusters by parallelizing Loop 1, independent buﬀers are used for A c and B c , and no dependencies arise. However, if they are partitioned between theclusters by parallelizing Loop 3, then the buﬀer for B c is shared, and it is nec-essary to employ a common value of k c for the Cortex-A15 and the Cortex-A7.In this scenario the parameter is set to k c = 952 in both control-trees, and anew (sub)optimal value for m c has to be obtained for the Cortex-A7 threads.In order to do that, we carried out a similar search as that exposed in Sec-tion 3, ﬁnding the new optimal value at m c = 32 for the Cortex-A7 (takinginto account that the k c parameter depends on the Cortex-A15). With thesenew optimal parameters, the performance peak attained with the Cortex-A7 cluster is slightly worse than that observed the actual Cortex-A7-speciﬁcoptimal parameters. However, it is still higher than that obtained with thecache parameters for the Cortex-A15 as, with those much larger values, thememory buﬀer A c does not ﬁt into the small L2 cache of the Cortex-A7. sas and ca-sas The combination of the coarse-grain and ﬁne-grain parallelization strate-gies described in Section 5.2.1 yields the same parallelization options for ca-sas . For the same reasons, we only report next the results corresponding20 G F L O PS Problem dimension (r)GEMM GFLOPSSAS - Ratio 5SAS - Ratio 3SAS - Ratio1CA-SAS - Ratio 5 CA-SAS - Ratio 3CA-SAS - Ratio 1Ideal(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WSAS - Ratio 5SAS - Ratio 3SAS - Ratio1CA-SAS - Ratio 5 CA-SAS - Ratio 3CA-SAS - Ratio 1

Figure 10: Performance (left) and energy-eﬃciency (right) of the sas and ca-sas versionsof BLIS gemm with a coarse-grain parallelization of Loop 1 and a ﬁne-grain parallelizationof Loop 4 using 4 threads per cluster. to an scenario where the iteration space is distributed between the clustersacross Loop 1, while the macro-kernel is partitioned within clusters in Loop 4,using (distribution) ratios for the inter-cluster parallelization of 1, 3 and 5.For each distribution ratio, we include two lines, corresponding to the use oftwo control-trees ( ca-sas ) and only one ( sas ).The plots in Figure 10 illustrate that, for both metrics of interest, betterresults are obtained with the option that integrates two control-trees. Theincreases of performance and energy eﬃciency are a direct consequence ofthe accelerated execution of the workload assigned to the Cortex-A7 cluster,derived from the use of more convenient cache conﬁguration parameters. Wenotice that the improvements at this point are only visible when too muchwork is assigned to the Cortex-A7 cluster (i.e., for distribution ratios below5). However, as we will expose later, this strategy has a more visible impactwhen a dynamic workload distribution is adopted.To conclude the evaluation of the ca-sas implementation of BLIS, wecompare the four direct combinations (parallelization options) of the coarse-grain (Loop 1 or Loop 3) and ﬁne-grain (Loop 4 or Loop 5) options, for aconcrete distribution ratio of 5, using two control-trees. Figure 11 reportsthe outcome from this evaluation. The plots show that the ﬁne-grain par-allelization of Loop 4 yields performance curves closer to that of the idealcase than the alternatives that parallelize Loop 5. The reason is that n c (linked to Loop 4) is usually much larger than m c (linked to Loop 5) and,21 G F L O PS Problem dimension (r)GEMM GFLOPSCA-SAS - Loop 1 + 4CA-SAS - Loop 1 + 5CA-SAS - Loop 3 + 4CA-SAS - Loop 3 + 5Ideal (Loop 4)(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WCA-SAS - Loop 1 + 4CA-SAS - Loop 1 + 5CA-SAS - Loop 3 + 4CA-SAS - Loop 3 + 5(cid:10)

Figure 11: Performance (left) and energy-eﬃciency (right) of the ca-sas version of BLIS gemm with a coarse-grain parallelization of either Loop 1 or Loop 3; combined with aﬁne-grain parallelization of either Loop 4 or Loop 5, using a ratio 5 in both cases and 4threads per cluster. therefore, it is easier to attain a more balanced workload distribution withthis option. Although it is not possible to leverage the actual optimal cacheparameters speciﬁc to the Cortex-A7 cluster when Loop 3 is parallelized theplots also reveal that, when the ﬁne-grain parallelization is set Loop 4, thereis no noticeable diﬀerence between distributing the computational workloadin either Loop 1 or in Loop 3; however the diﬀerence is present when theﬁne-grain parallelization is set in Loop 5. ca-das ) Our ﬁnal step towards attaining a high performance implementation ofBLIS gemm for an AMP SoC integrates a mechanism that dynamically dis-tributes the workload between the two SoC clusters. The main advantageof this option is that a predeﬁned distribution ratio becomes unnecessary,though the target loop this feature is applied to still needs to be chosen withcare.The candidates to apply a dynamic distribution are, obviously, Loop 1and Loop 3, since these have been previously identiﬁed as the best options todistribute the computational workload between the two clusters. However,the cache parameter n c (linked to the stride of Loop 1) is often in the orderof several hundreds up to a few thousands and, therefore, in practice it is toolarge to dynamically distribute the iteration space. In contrast, the cacheparameter m c (linked to the stride of Loop 3) is usually in the order of a22ew hundreds, and thus it is a good candidate to dynamically distribute theiterations. Diving into details, n c = 4 ,

096 for both types of cores, while m c = 32 and 152 for the Cortex-A7 and Cortex-A15 cores, respectively.In consequence, the coarse-grain dynamic distribution of the workload willbe carried out across Loop 3, with two independent control-trees in placebinded to “fast” and “slow” threads. Note that, like in the ca-sas schedulingstrategy, the buﬀer B c is shared by both clusters and, in consequence, k c isset to 952 for both types of cores (cache-aware optimization).The application of a dynamic scheduling strategy removes the static par-titioning carried out before Loop 3. This is replaced by a mechanism where,at each iteration of Loop 3, a single thread bound to a “fast” core and asingle thread bound to a “slow” core select the current chunk size, whichdepend on the conﬁguration parameter m c of each type of core. The selectedworkload is broadcast to the remaining threads of the same type. The ﬁne-grain parallelization remains unmodiﬁed and targets Loop 4, Loop 5 or both.The chunk size selection is performed inside a critical section that controlsthe execution of Loop 3. The overhead of this synchronization point is fullyamortized by the utilization of a more ﬂexible workload distribution. ca-das This last round of experiments presents a more reduced number of op-tions, since Loop 1 was identiﬁed as a poor choice to dynamically distribut-ing the computational workload. We report results when the iteration spaceis dynamically distributed between clusters across Loop 3, and the macro-kernel is partitioned within clusters in Loop 4 or in Loop 5, using either twocontrol-trees (one for “fast” and one for “slow” threads, ca-das ) or a singlecontrol-tree for both types of threads ( das ). Additionally, for comparisonpurposes, we include the performance lines of the best ca-sas strategy witha distribution ratio of 5.The plots in Figure 12 reveal that, for both metrics of interest, the bestresults are attained when the coarse-grain parallelization is dynamically ap-plied to Loop 3 and the ﬁne-grain parallelization is done at Loop 4. If theﬁne-grain parallelization is set across Loop 5, the results are inferior to thosereported for the static approach, since the amount of concurrency that canbe extracted is lower for Loop 5 than for Loop 4 (see Figure 11 and thecorresponding analysis for details). On the other hand, the plots show thatthe use of two control-trees has a great impact on both metrics. The use ofa common control-tree implies that the chunk size assigned to both types of23hreads is the same. Therefore, due to the diﬀerence in performance of theCortex-A7 and Cortex-A15 clusters, this produces a severe load unbalancefor certain problem sizes. G F L O PS Problem dimension (r)GEMM GFLOPSCA-SAS - Ratio 5 - Loop 3 + 4CA-DAS - Loop 3 + 4CA-DAS - Loop 3 + 5DAS - Loop 3 + 4DAS - Loop 3 + 5Ideal (Loop 4)(cid:10) G F L O PS / W Problem dimension (r)GEMM GFLOPS/WCA-SAS - Ratio 5 - Loop 3 + 4CA-DAS - Loop 3 + 4CA-DAS - Loop 3 + 5DAS - Loop 3 + 4DAS - Loop 3 + 5(cid:10)

Figure 12: Performance (left) and energy-eﬃciency (right) of the ca-das and das versionsof BLIS gemm with a coarse-grain parallelization of Loop 3 and a ﬁne-grain parallelizationof either Loop 4 or Loop 5, using 4 threads per cluster in both cases.

6. Conclusions

We have proposed and evaluated several mechanisms to eﬃciently mapthe framework for matrix multiplication integrated in the BLIS library toan asymmetric ARM big.LITTLE (Cortex A15+A7) SoC. Our results revealexcellent improvements in performance compared with a homogeneous im-plementation that operates exclusively on one type of core (either A15 orA7), and also with respect to multi-threaded implementations that simplyapply a symmetric workload distribution and do not take into account thediﬀerent cache organization of the cores.This is an important step towards a full BLAS implementation optimizedfor big.LITTLE architectures, which is a future goal in our research eﬀort.While we believe that the approach applied to gemm carries over to therest of the BLAS, there are a number of issues that need to be addressedto further increase performance and adaption to other (present and future)asymmetric architectures. Among others, the most relevant factor is theadoption of diﬀerent micro-kernels, tuned to each type of core, in order toextract the maximum performance for those asymmetric architectures. A24ort to a 64-bit ARMv8 architecture, and an experimental study on archi-tectures with diﬀerent number of big/LITTLE cores are also key milestonesin our roadmap.

Acknowledgments

The researchers from Universitat Jaume I were supported by project CI-CYT TIN2011-23283 of MINECO and FEDER, the EU project FP7 318793“EXA2GREEN” and the FPU program of MECD. The researcher from Uni-versidad Complutense de Madrid was supported by project CICYT TIN2012-32180.

References [1] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, A. LeBlanc, Designof ion-implanted MOSFET’s with very small physical dimensions, Solid-State Circuits, IEEE Journal of 9 (5) (1974) 256–268.[2] G. Moore, Cramming more components onto integrated circuits, Elec-tronics 38 (8) (1965) 114–117.[3] M. Duranton et al , The HiPEAC vision for advanced computing in hori-zon 2020, (2013).[4] J. F. Lavignon et al , ETP4HPC strategic research agenda achievingHPC leadership in Europe.[5] R. Lucas et al , Top ten Exascale research challenges, http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf (2014).[6] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, D. Burger,Dark silicon and the end of multicore scaling, in: Proc. 38th Annual Int.Symp. on Computer architecture, ISCA’11, 2011, pp. 365–376.[7] The TOP500 list (2015).URL [8] The GREEN500 list (2015).URL http://software.intel.com/en-us/intel-mkl (2015).[17] AMD, AMD Core Math Library, http://developer.amd.com/tools/cpu/acml/pages/default.aspx (2015).[18] IBM, Engineering and Scientiﬁc Subroutine Library, (2015).2619] NVIDIA, CUDA basic linear algebra subprograms, https://developer.nvidia.com/cuBLAS (2015).[20] K. Goto, R. van de Geijn, Anatomy of a high-performance matrix mul-tiplication, ACM Trans. Math. Softw. 34 (3) (2008) 12:1–12:25.[21] K. Goto, R. van de Geijn, High performance implementation of the level-3 BLAS, ACM Trans. Math. Softw. 35 (1) (2008) 4:1–4:14.URL http://doi.acm.org/10.1145/1377603.1377607 [22] OpenBLAS, http://xianyi.github.com/OpenBLAS/ (2015).[23] F. G. Van Zee, R. A. van de Geijn, BLIS: A framework for generatingblas-like libraries, ACM Trans. Math. Soft.To appear.[24] R. C. Whaley, J. J. Dongarra, Automatically tuned linear algebra soft-ware, in: Proceedings of SC’98, 1998.[25] N. Chitlur, G. Srinivasa, S. Hahn, P. Gupta, D. Reddy, D. Koufaty,P. Brett, A. Prabhakaran, L. Zhao, N. Ijih, S. Subhaschandra, S. Grover,X. Jiang, R. Iyer, Quickia: Exploring heterogeneous architectures onreal prototypes, in: High Performance Computer Architecture (HPCA),2012 IEEE 18th International Symposium on, 2012, pp. 1–8. doi:10.1109/HPCA.2012.6169046 .[26] N. Chitlur, G. Srinivasa, S. Hahn, P. K. Gupta, D. Reddy, D. Koufaty,P. Brett, A. Prabhakaran, L. Zhao, N. Ijih, S. Subhaschandra, S. Grover,X. Jiang, R. Iyer, Quickia: Exploring heterogeneous architectures onreal prototypes, in: Proc. IEEE 18th Int. Symp. on High-PerformanceComputer Architecture, HPCA’12, 2012, pp. 1–8.[27] J. Hourd, C. Fan, J. Zeng, Q. S. Zhang, M. J. Best, A. Fedorova, C. Mus-tard, Exploring practical beneﬁts of asymmetric multicore processors, in:2nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures, PESPMA 2009.[28] N. B. Lakshminarayana, J. Lee, H. Kim, Age based scheduling for asym-metric multiprocessors, in: Proc. Conference on High Performance Com-puting Networking, Storage and Analysis, SC’09, 2009, pp. 25:1–25:12.2729] R. Rodrigues, A. Annamalai, I. Koren, S. Kundu, Improving perfor-mance per watt of asymmetric multi-core processors via online programphase classiﬁcation and adaptive core morphing, ACM Trans. Des. Au-tom. Electron. Syst. 18 (1) (2013) 5:1–5:23.[30] D. Clarke, A. Lastovetsky, V. Rychkov, Column-based matrix partition-ing for parallel matrix multiplication on heterogeneous processors basedon functional performance models, in: Euro-Par 2011: Parallel Process-ing Workshops, Vol. 7155 of LNCS, 2012, pp. 450–459.[31] O. Beaumont, L. Marchal, Analysis of dynamic scheduling strate-gies for matrix multiplication on heterogeneous platforms, in: Proc.23rd Int. Symp. High-performance Parallel and Distributed Computing,HPDC’14, 2014, pp. 141–152.[32] T. M. Low, F. D. Igual, T. M. Smith, E. S. Quintana-Ort´ı, Analyti-cal modeling is enough for high performance BLIS, ACM Trans. Math.Soft.In review. Available at .[33] F. G. Van Zee, T. M. Smith, B. Marker, T. M. Low, R. A. van deGeijn, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel,J. Gunnels, L. Killough, The BLIS framework: Experiments in porta-bility, ACM Trans. Math. Soft.In review. Available at .[34] T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, F. G.Van Zee, Anatomy of high-performance many-threaded matrix multi-plication, in: Proc. IEEE 28th Int. Parallel and Distributed ProcessingSymp., IPDPS’14, 2014, pp. 1049–1059.[35] P. Alonso, R. M. Badia, J. Labarta, M. Barreda, M. F. Dolz, R. Mayo,E. S. Quintana-Ort´ı, R. Reyes, Tools for power-energy modelling andanalysis of parallel scientiﬁc applications, in: 41st Int. Conf. on ParallelProcessing – ICPP, 2012, pp. 420–429.[36] T. M. Low, F. D. Igual, T. M. Smith, , E. S. Quintana-Ort´ı, Analyticalmodeling is enough for high performance BLIS, Tech. Rep. FLAWN