Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
aa r X i v : . [ c s . D C ] N ov Blocked All-Pairs Shortest Paths Algorithm onIntel Xeon Phi KNL Processor: A Case Study
Enzo Rucci ∗ , Armando De Giusti , and Marcelo Naiouf III-LIDI, CONICET, Facultad de Inform´atica, UniversidadNacional de La Plata III-LIDI, Facultad de Inform´atica, Universidad Nacional de LaPlataNovember 6, 2018
Abstract
Manycores are consolidating in HPC community as a way of improv-ing performance while keeping power efficiency. Knights Landing is therecently released second generation of Intel Xeon Phi architecture. Whileoptimizing applications on CPUs, GPUs and first Xeon Phi’s has beenlargely studied in the last years, the new features in Knights Landing pro-cessors require the revision of programming and optimization techniquesfor these devices. In this work, we selected the Floyd-Warshall algorithmas a representative case study of graph and memory-bound applications.Starting from the default serial version, we show how data, thread andcompiler level optimizations help the parallel implementation to reach 338GFLOPS.
The power consumption problem represents one of the major obstacles for Ex-ascale systems design. As a consequence, the scientific community is searchingfor different ways to improve power efficiency of High Performance Computing(HPC) systems [8]. One recent trend to increase compute power and, at thesame time, limit power consumption of these systems lies in adding accelerators,like NVIDIA/AMD graphic processing units (GPUs), or Intel Many IntegratedCore (MIC) co-processors. These manycore devices are capable of achievingbetter FLOPS/Watt ratios than traditional CPUs. For example, the number ofTop500 [2] systems using accelerator technology grew from 54 in June 2013 to 91in June 2017. In the same period, the number of systems based on acceleratorsincreased from 55 to 90 on the Green500 list [1]. ∗ [email protected] final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 O ( n ) operations and O ( n ) memory space, where n is the number ofvertices in a graph. Starting from the default serial version, we show how data,thread and compiler level optimizations help the parallel implementation toreach 338 GFLOPS.The rest of the present paper is organized as follows. Section 2 briefly in-troduces the Intel Xeon Phi KNL architecture while Section 3 presents the FWalgorithm. Section 4 describes our implementation. In Section 5 we analyzeperformance results while Section 6 discusses related works. Finally, Section 7outlines conclusions and future lines of work. KNL is the second generation of the Intel Xeon Phi family and the first capableof operating as a standalone processor. The KNL architecture is based on aset of
Tiles (up to 36) interconnected by a 2D mesh. Each Tile includes 2cores based on the out-of-order Intel’s Atom micro-architecture (4 threads percore), 2 Vector Processing Units (VPUs) and a shared L2 cache of 1 MB. TheseVPUs not only implement the new 512-bit AVX-512 ISA but they are alsocompatible with prior vector ISA’s such as SSE x and AVX x . AVX-512 provides512-bit SIMD support, 32 logical registers, 8 new mask registers for vectorpredication, and gather and scatter instructions to support loading and storingsparse data. As each AVX-512 instruction can perform 8 double-precision (DP)operations (8 FLOPS) or 16 single-precision (SP) operations (16 FLOPS), thepeak performance is over 1.5 TFLOPS in DP and 3 TFLOPS in SP, more thantwo times higher than that of the KNC. It is also more energy efficient than itspredecessor [17].Other significant feature of the KNL architecture is the inclusion of an in-package HBM called MCDRAM. This special memory offers 3 operating modes: Cache , Flat and
Hybrid . In Cache mode, the MCDRAM is used like an L3cache, caching data from the DDR4 level of memory. Even though applicationcode remains unchanged, the MCDRAM can suffer lower performance rates.In Flat mode, the MCDRAM has a physical addressable space offering thehighest bandwidth and lowest latency. However, software modifications maybe required in order to use both the DDR and the MCDRAM in the sameThe final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5
Hybrid mode , HBM is divided in two parts: one partin
Cache mode and one in
Flat mode [5].From a software perspective, KNL supports parallel programming modelsused traditionally on HPC systems such as OpenMP or MPI. This fact representsa strength of this platform since it simplifies code development and improvesportability over other alternatives based on accelerator specific programminglanguages such as CUDA or OpenCL. However, to achieve high performance,programmers should attend to: • the efficient exploitation of the memory hierarchy, especially when han-dling large datasets, and • how to structure the computations to take advantage of the VPUs.Automatic vectorization is obviously the easiest programming way to exploitVPUs. However, in most cases the compiler is unable to generate SIMD binarycode since it can not detect free data dependences into loops. In that sense,SIMD instructions are supported in KNL processors through the use of guidedcompilation or hand-tuned codification with intrinsic instructions [17]. On onehand, in guided vectorization, the programmer indicates the compiler (throughthe insertion of tags) which loops are independent and their memory patternaccess. In this way, the compiler is able to generate SIMD binary code preserv-ing the program portability. On the other hand, intrinsic vectorization usuallyinvolves rewriting most of the corresponding algorithm. The programmer gainsin control at the cost of losing portability. Moreover, this approach also sug-gests the inhibition of other compiler loop-level optimizations. Nevertheless,it is the only way to exploit parallelism in some applications with no regularaccess patterns or loop data dependencies which can be hidden by recomputingtechniques [6]. The FW algorithm uses a dynamic programming approach to compute the all-pairs shortest-paths problem on a directed graph [7, 20]. This algorithm takesas input a N × N distance matrix D , where D i,j is initialized with the originaldistance from node i to node j . FW runs for N iterations and at k -th iteration itevaluates all the possible paths between each pair of vertices from i to j throughthe intermediate vertex k . As a result, FW produces an updated matrix D ,where D i,j now contains the shortest distance between nodes i and j . Besides,an additional matrix P is generated when the reconstruction of the shortest pathis required. P i,j contains the most recently added intermediate node between i and j . Figure 1 exhibits the naive FW algorithm. In this section, we address the optimizations performed on the Intel Xeon PhiKNL processor. First of all, we developed a serial implementation followingthe naive version described in Figure 1, as this implementation will work asbaseline. Next, we optimized the serial version considering data locality andThe final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5
To improve data locality, the FW algorithm can be blocked [19]. Unfortunately,the three loops can not be interchanged in free manner due to the data depen-dencies from one iteration to the next in the k -loop (just i and j loops can bedone in any order). However, under certain conditions, the k -loop can be putinside the i -loop and j -loop, making blocking possible. The distance matrix D is partitioned into blocks of size BS × BS , so that there are ( N/BS ) blocks.The computations involve R = N/BS rounds and each round is divided intofour phases based on the data dependency among the blocks:1. Update the block k,k ( D k,k ) because it is self-dependent.2. Update the remaining blocks of the k -th row because each of these blocksdepends on itself and the previously computed D k,k .3. Update the remaining blocks of the k -th column because each of theseblocks depends on itself and the previously computed D k,k .4. Update the rest of the matrix blocks as each of them depends on the k -thblock of its row and the k -th block of its column.In this way, we satisfy all dependencies from this algorithm. Figure 2 showsa schematic representation of a round computation and the data dependencesamong the blocks while Figure 3 presents the corresponding pseudo-code. The innermost loop of FW BLOCK code block from Figure 3 is clearly themost computationally expensive part of the algorithm. In that sense, this loopis the best candidate for vectorization. The loop body is composed of an if statement that involves one addition, one comparison and (may be) two assignoperations. Unfortunately, the compiler detects false dependencies in that loopand is not able to generate SIMD binary code. For that reason, we have exploredtwo SIMD exploitation approaches: (1) guided vectorization through the usageFigure 1: Naive Floyd-Warshall Algo-rithm Figure 2: Schematic representa-tion of the blocked Floyd-WarshallAlgorithmThe final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 simd directive and (2) intrinsic vectorization employingthe AVX-512 extensions. The guided approach simply consists of inserting the simd directive to the innermost loop of FW BLOCK code block (line 4). Onthe opposite sense, the intrinsic approach consists of rewriting the entire loopbody. Figures 4 and 5 show the pseudo-code for FW BLOCK implementationusing guided and manual vectorization, respectively. In order to accelerateSIMD computation with 512-bit vectors, we have carefully managed the memoryallocations so that distance and path matrices are 64-byte aligned. In the guidedapproach, this also requires adding the aligned clause to the simd directive. Loop unrolling is another optimization technique that helped us to improvethe code performance. Fully unrolling the innermost loop of FW BLOCK codeblock was found to work well. Unrolling the i -loop of the same code block oncewas also found to work well.The final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 .4 Thread Level Parallelism To exploit parallelism across multiple cores, we have implemented a multi-threaded version of FW algorithm based on OpenMP programming model. A parallel construct is inserted before the loop of line 13 in Figure 3 to create aparallel block. To respect data dependencies among the block computations,the work-sharing constructs must be carefully inserted. At each round, phase 1must be computed before the rest. So a single construct is inserted to encloseline 16. Next, phases 2 and 3 must be computed before phase 4. As these blocksare independent among them, a for directive is inserted before the loops of lines18 and 22. Besides, a nowait clause is added to the phase 2 loop to alleviatethe thread idling. Finally, another for construct is inserted before the loop ofline 26 to distribute the remaining blocks among the threads.
All tests have been performed on an Intel server running CentOS 7.2 equippedwith a Xeon Phi 7250 processor 68-core 1.40GHz (4 hw thread per core and16GB MCDRAM memory) and 48GB main memory. The processor was run in
Flat memory mode and
Quadrant cluster mode.We have used Intel’s ICC compiler (version 17.0.1.132) with the -O3 opti-mization level. To generate explicit AVX2 and AVX-512 instructions, we em-ployed the -xAVX2 and -xMIC-AVX512 flags, respectively. Also, we used the numactl utility to exploit MCDRAM memory (no source code modification isrequired). Besides, different workloads were tested: N = { } . First, we evaluated the performance improvements of the different optimiza-tion techniques applied to the naive serial version, such as blocking ( blocked ),data level parallelism ( simd , simd (AVX2) and simd (AVX-512) ), aligned access( aligned ) and loop unrolling ( unrolled ). Table 1 shows the execution time (inseconds) of the different serial versions when N =4096. As it can be observed,blocking optimization reduces execution time by 5%. Regarding the block size,256 ×
256 was found to work best. In the most memory demanding case of eachround (phase 4), four blocks are loaded into the cache (3 distance blocks and 1path block). The four blocks requires 4 × × × simd constructsto the blocked version reduced the execution time from 572.66 to 204.52 sec-onds, which represents a speedup of 2.8 × . However, AVX-512 instructions canperform 16 SP operations at the same time. After inspecting the code at assem-bly level, we realized that the compiler generates SSE x instructions by default.As SSE x can perform 4 SP operations at the same time, the 2.8 × speedup hasmore sense since not all the code can be vectorized. Next, we re-compiled thecode including the -xAVX2 and -xMIC-AVX512 flags to force the compiler toThe final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 N =4096. naive blocked simd simd (AVX2) simd (AVX-512) aligned unrolled × while AVX-512 instructionsachieved an speedup of 15.5 × . So, it is clear that this application benefitsfrom larger SIMD width. In relation to the other optimization techniques em-ployed, we have found that the simd (AVX-512) implementation runs 1.11 × faster when aligning memory accesses in AVX-512 computations ( aligned ). Ad-ditionally, applying the loop unrolling optimization to the aligned version led tohigher performance, gaining a 1.45 × speedup. In summary, we achieve a 26.3 × speedup over the naive serial version through the combination of the differentoptimizations described.Taking the optimized serial version, we developed a multi-threaded imple-mentation as described in Section 4.4. Figure 6 shows the performance (in termsof GFLOPS) for the different affinity types used varying the number of threadswhen N =8192. As expected, compact affinity produced the worst results sinceit favours using all threads on a core before using other cores. Scatter and balanced affinities presented similar performances improving the none counter-part. As the KNL processor used in this study has all its cores in the samepackage, scatter and balanced affinities distribute the threads in the same man-ner when one thread per core is assigned. Regarding the number of threads,using a single thread per core is enough to get maximal performance (except in compact affinity). This behavior is opposed to the KNC generation where twoor more threads per core where required to achieve high performance. However,it should not be a surprise since the KNL cores were designed to optimize singlethread performance including out-of-order pipelines and two VPUs per core.It is important to remark that, unlike the optimized serial version, the par-allel implementation used a smaller block size since it delivered higher perfor-mance. A smaller block size allowed a finer-grain workload distribution anddecreased thread idling, especially when the number of threads was larger thanthe number of blocks in phases 2 and 3. Another reason to decrease block sizewas that the L2 available space is now shared between the threads in a tile,contrary to the single threaded case. In particular, BS =64 was found to workbest.Figure 7 illustrates performance evolution varying workload and MCDRAMexploitation for the different vectorization approaches. For small workloads( N = 8192), the performance improvement is little ( ∼ × ). However, MC-DRAM memory presents remarkable speedups for greater workloads, even whenthe dataset largely exceeds the MCDRAM size ( N = 655536). In particular,MCDRAM exploitation achieves an average speedup of 9.8 × and a maximumspeedup of 15.5 × . In this way, we can see how MCDRAM usage is an efficientstrategy for bandwidth-sensitive applications.In relation to the vectorization approach, we can appreciate that guidedvectorization leads to slightly better performance than the intrinsic counterpart,The final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 N =8192.running upto 1.03 × faster. The best performances are 330 and 338 GFLOPSfor the intrinsic and guided versions, respectively. After analyzing the assemblycode, we realized that this difference is caused by the prefetching instructionsintroduced by the compiler when guided vectorization is used. Unfortunately,the compiler disables automatic prefetching when code is manually vectorized. Despite its recent commercialization, there are some works that evaluate KNLprocessors. In that sense, we highlight [18] that presents a study of the per-formance differences observed when using the three MCDRAM configurationsavailable in combination with the three possible memory access or cluster modes.Also, Barnes et al. [3] discussed the lessons learned from optimizing a numberof different high-performance applications and kernels. Besides, Haidar et al. [9]proposed and evaluated several optimization techniques for different matrix fac-torizations methods on many-core systems.Obtaining high-performance in graph algorithms is usually a difficult tasksince they tend to suffer from irregular dependencies and large space require-ments. Regarding FW algorithm, there are many works proposed to solve theall-pairs shortest paths problem on different harwdare architectures. However,to the best of the authors knowledge, there are no related works with KNL pro-cessors. Han and Kang [10] demonstrated that exploiting SSE2 instructions ledto 2.3 × -5.2 × speedups over a blocked version. Bondhugula et al. [4] proposeda tiled parallel implementation using Field Programmable Gate Arrays. In thefield of GPUs, we highlight the work of Katz and Kider [13], who proposed ashared memory cache efficient implementation to handle graph sizes that areinherently larger than the DRAM memory available on the device. Also, Mat-sumoto et al. [15] presented a blocked algorithm for hybrid CPU-GPU systemsThe final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 KNL is the second generation of Xeon Phi family and features new technologiesin SIMD execution and memory access. In this paper, we have evaluated aset of programming and optimization techniques for these processors takingthe FW algorithm as a representative case study of graph and memory-boundapplications. Among the main contributions of this research we can summarize: • Blocking technique not only improved performance but also allowed us toapply a coarse-grain workload distribution in the parallel implementation. • SIMD exploitation was crucial to achieve top performance. In particular,the serial version run 2.9 × , 6 × and 15.5 × faster using the SSE, AVX2 andAVX-512 extensions, respectively. • Aligning memory accesses and loop unrolling also showed significant speedups. • A single thread per core was enough to get maximal performance. Inaddition, scatter and balanced affinities provided extra performance. • Besides keeping portability, guided vectorization led to slightly better per-formance than the intrinsic counterpart, running upto 1.03 × faster.The final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 MCDRAM usage demonstrated to be an efficient strategy to tolerate high-bandwidth demands with practically null programmer intervention, evenwhen the dataset largely exceeded the MCDRAM size. In particular, itproduced an average speedup of 9.8 × and a maximum speedup of 15.5 × As future work, we consider evaluating programming and optimization tech-niques in other cluster and memory modes as a way to extract more performance.
Acknowledgments
The authors thank the ArTeCS Group from Universidad Complutense de Madridfor letting use their Xeon Phi KNL system.
References [1] Green500 Supercomputer Ranking, [2] Top500 Supercomputer Ranking, [3] Barnes, T., et al.: Evaluating and Optimizing the NERSC Workload onKnights Landing. In: Proceedings of the 7th International Workshop onPerformance Modeling, Benchmarking and Simulation of High PerformanceComputing Systems. pp. 43–53. PMBS ’16, IEEE Press, Piscataway, NJ,USA (2016)[4] Bondhugula, U., Devulapalli, A., Dinan, J., Fernando, J., Wyckoff, P.,Stahlberg, E., Sadayappan, P.: Hardware/software integration for fpga-based all-pairs shortest-paths. In: 2006 14th Annual IEEE Symposiumon Field-Programmable Custom Computing Machines. pp. 152–164 (April2006)[5] Codreanu, V., Rodrguez, J., Saastad, O.W.:Best Practice Guide - Knights Landing (2017), [6] Culler, D.E., Gupta, A., Singh, J.P.: Parallel Computer Architecture:A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 1st edn. (1997)[7] Floyd, R.W.: Algorithm 97: Shortest path. Commun. ACM 5(6), 345–(Jun 1962)[8] Giles, M.B., Reguly, I.: Trends in high-performance computing for engi-neering calculations. Philosophical Transactions of the Royal Society ofLondon A: Mathematical, Physical and Engineering Sciences 372(2022)(2014)[9] Haidar, A., Tomov, S., Arturov, K., Guney, M., Story, S., Dongarra, J.:LU, QR, and Cholesky factorizations: Programming model, performanceanalysis and optimization techniques for the Intel Knights Landing XeonPhi. In: 2016 IEEE High Performance Extreme Computing Conference(HPEC). pp. 1–7 (Sept 2016)The final publication is available at Springer via https://doi.org/10.1007/978-3-319-75214-3_5 https://doi.org/10.1007/978-3-319-75214-3_5https://doi.org/10.1007/978-3-319-75214-3_5