Parallelware Tools: An Experimental Evaluation on POWER Systems
aa r X i v : . [ c s . D C ] F e b Parallelware tools: An experimental evaluationon POWER systems
Manuel Arenaz and Xavier Martorell University of A Coruna and Appentra Solutions, Spain, [email protected] Computer Architecture Dept., Universitat Polit´ecnica de Catalunya and ComputerSciences Dept., Barcelona Supercomputing Center, Spain [email protected]
Abstract.
Static code analysis tools are designed to aid software devel-opers to build better quality software in less time, by detecting defectsearly in the software development life cycle. Even the most experienceddeveloper regularly introduces coding defects. Identifying, mitigating andresolving defects is an essential part of the software development process,but frequently defects can go undetected. One defect can lead to a minormalfunction or cause serious security and safety issues. This is magnifiedin the development of the complex parallel software required to exploitmodern heterogeneous multicore hardware. Thus, there is an urgent needfor new static code analysis tools to help in building better concurrentand parallel software. The paper reports preliminary results about theuse of Appentra’s Parallelware technology to address this problem fromthe following three perspectives: finding concurrency issues in the code,discovering new opportunities for parallelization in the code, and gener-ating parallel-equivalent codes that enable tasks to run faster. The paperalso presents experimental results using well-known scientific codes andPOWER systems.
Keywords: static code analysis, quality assurance and testing, detec-tion of software defects, concurrency and parallelism, Parallelware tools,OpenMP, tasking, POWER systems
Static code analysis tools are highly specialized to detect one or more defects,typically categorized into similar types of defects. These tools fulfill a group ofspecific needs of software developers. It is only recently that heterogeneous multi-core systems have been adopted in a wide-range of hardware in industrial sectorssuch as automotive, wireless communication and embedded vision. Therefore itis increasingly important to develop new static code analyses that address thefundamental problem of concurrency, which means that many tasks running atthe same time on the same hardware can lead to unpredictable and incorrectbehaviour. Identifying and fixing issues related to concurrency and parallelismis one of the most time-consuming and costly aspects of parallel programming.owever, static code analysis tools that detect defects related to parallel pro-gramming are at a very early stage.This papers presents an experimental evaluation of Appentra’s Parallelwarestatic code analysis tools on POWER systems, which go beyond the state ofthe art by addressing the problem of concurrency and parallelism from threedifferent perspectives: finding concurrency issues in the code, discovering newopportunities for parallelization in the code, and generating parallel-equivalentcode that enables tasks to runs faster. In the rest of the paper, Section 2 describesthe current set of Parallelware tools, namely, the Parallelware development li-brary, Parallelware Analyzer (BETA) and Parallelware Trainer. Next, Section 3presents early results from the analysis of the SNU NPB Suite [6], a C versionof the NAS Parallel Benchmarks [5], using POWER systems available at theJ¨ulich Supercomputing Centre and at Appentra headquarters. Finally, Section 4presents conclusions and future work.
Appentra is a Deep Tech global company that delivers products based on theParallelware technology [4,1], a unique approach to static code analysis for con-current and parallel programming. It is based on an engine for the detection ofparallel patterns such as forall, scalar reduction, sparse forall and sparse reduc-tion. These patterns are used to detect software issues related to concurrencyand parallelism, discover parallelism and generate parallel-equivalent code. Thecurrent portfolio of tools based on Parallelware technology is as follows: – Parallelware developer library , which offers the static code analysis ca-pabilities of the Parallelware technology. It provides an Application ProgramInterface (API) that is the basis of Parallelware Analyzer and ParallelwareTrainer, and that is designed to enable the integration in third-party softwaredevelopment tools. It supports the C programming language, the OpenMP4.5 [8] and OpenACC 2.5 [7] directive-based parallel programming inter-faces, and the multithreading, offloading and tasking parallel programmingparadigms. – Parallelware Analyzer (BETA) [3] is designed to speed up the devel-opment of parallel applications and to enforce best practice in parallel pro-gramming for heterogeneous multicore systems. It helps software developersby finding software defects early in the parallelization process and thus in-creases productivity, maintainability and sustainability. It is available as aset of command-line tools to enable compatibility with Continuous Integra-tion and DevOps platforms. – Parallelware Trainer [3] is an interactive, real-time code editor that en-ables scalable, interactive teaching and learning of parallel programming,increasing productivity and retention of learning. It is available for Win-dows, Linux and MacOS operating systems.
Experimental results
This section presents experimental results obtained on POWER systems usingParallelware tools. More specifically, Section 3.1 presents the report generatedby Parallelware Analyzer and Section 3.2 presents experimental results of codesparallelized using Parallelware Trainer.
The report shown in Table 1 was generated by the Parallelware Analyzer toolafter analyzing codes written in the C programming language from the SNUNPB Suite [6,5] benchmarks (NPB-SER-C and NPB-OMP-C implementations).The structure of the report is as follows:
Benchmark , the software application;
Files , number of source code files;
SLOC , source lines of code calculated by the sloccount tool;
Time , runtime of the Parallelware Analyzer tool in milliseconds;
Software issues , number of issues found in the code related to concurrency andparallelism; and
Opportunities , number of loops found in the code that haveopportunities for parallelization using multithreading and SIMD paradigms. Thelast row of the table provides total numbers for all the analyzed benchmarks.The current tool setup reports five software issues related to concurrencyand parallelism:
Global , use of global variables in the body of a function;
Scope ,scalar variables not declared in the smallest scope possible in the code;
Pure , purefunctions free of side effects not marked by the programmer;
Scoping , variablesin an OpenMP parallel region without an explicit data scoping; and
Default ,OpenMP parallel region without the default(none) clause. More informationabout each one of them can be found in the Appentra Knowledge website [2].Thetool also reports two types of opportunities for parallelization:
Multi , outer loopsthat can be parallelized with the multithreading paradigm; and
SIMD , innerloops that can be parallelized with the SIMD paradigm.Parallelware Analyzer successfully analyzed a total of 192 source files of code,containing 39890 lines of code written in the C programming language, in lessthan 13 seconds. In terms of software issues related to concurrency and paral-lelism, the tools detected a total of 296 uses of global variables in the body offunctions. There are 2082 declarations of scalars in a scope bigger than neces-sary. Moreover, a total of 9 pure functions that are free of side effects but notmarked as such were found. Finally, 329 variables with an implicit datascopingand 117 OpenMP parallel regions having a default one were detected. In terms ofopportunities for parallelization, a total of 312 outer loops and the same numberof inner loops can be parallelized using the multithreading and SIMD paradigmsrespectively.
The Parallelware Trainer tool was used to automatically generate several paral-lel versions of a code that computes the Mandelbrot sets. Four parallel versionsf Mandelbrot are considered in this work:
Sequential , serial version (see List-ing 1.1, ignoring the OpenMP directives);
Multithreading , OpenMP version us-ing multithreading paradigm (see Listing 1.1, which contains directives ); Taskwait , parallel version using OpenMP 3.0 tasking paradigm(see Listing 1.2, which contains directives and ); Taskloop , parallel version using OpenMP 4.5 tasking paradigm (seeListing 1.3, which contains directives ). It should be notedthat a software engineer with little experience used the tool to generate and testall the parallel versions for correctness and performance in less than one hour.Experiments were conducted on two POWER systems: a compute node ofthe
Juron supercomputer at J¨ulich Supercomputing Centre and the
Appentraserver available at Appentra’s headquarters. In Juron, the hardware setup ofeach compute node consists on a IBM S822LC system with 2x 10-core SMT8POWER8NVL CPUs, offering a total of 160 threads. It provides a
CentOS Linux7 (AltArch)
Linux operating system with a GCC 4.8.5 compiler. In Appentraserver, the hardware setup consists on a RaptorCS Talos II system equippedwith an 8-core SMT4 POWER9 processor, offering a total of 32 threads. It runsa
Debian 10 (buster)
Linux with a GCC 8.3.0 compiler. In both systems, GCCcompiler flags were used as follows: -O2 for sequential execution and -fopenmp-O2 for OpenMP-enabled parallel execution.Table 2 shows the runtimes and speedups for a problem size of 20000. Itsstructure is as follows:
Version , serial or parallel version of the code, one of
Se-quential , Multithreading , Taskwait and
Taskloop ; No. Threads , number of OpenMPthreads;
Time , runtime in seconds, and
Speedup , speedup calculated with respectto the sequential version, for each POWER system. The
Taskwait version is thefastest code both in Juron (maximum speedup is 56 for 160 threads) and Ap-pentra’s POWER9 server (maximum speedup above 28 for 32 and 64 threads).The
Multithreading version is also fast, but the speedup is below
Taskwait be-cause the OpenMP code generated by Paralellware Trainer includes the clause schedule(auto) which defaults to schedule(static) . Note that since the workloadof Mandelbrot is not constant, different threads are assigned different workloads.Therefore, schedule(static) is not the better choice and should be replaced by schedule(static,1) or schedule(dynamic) . Finally, note that the Taskloop versiondoes not scale with the number of threads. This needs to be further investigatedas we expected
Taskloop to also decrease the execution time on both systems.
Preliminary results show evidences that Parallelware tools have the potential tohelp software developers to build better quality parallel code. On the one side,Parallelware Analyzer was used to evaluate the SNU NPB Suite, a C implemen-tation of the NAS Parallel Benchmarks. The static code analysis capabilitiesof Parallelware technology reported the existence of data scoping issues in thecodes as well as the existence of pure functions which were not marked as such toprovide additional hints to the compiler. Additionally, the tool also reported thexistence of sequential loops that could be parallelized using the multithreadingand SIMD paradigms.On the other side, Parallelware Trainer provides a GUI that facilitates thegeneration of parallel version of a code, as well as the testing of those versionfor correctness and performance. In less than one hour, a software engineer withlittle experience in parallel programming generated several OpenMP-enabledparallel versions of the Mandelbrot algorithm using multithreading and taskingparadigms. Performance tests showed significant speedups on both Juron andAppentra POWER systems.As future work, we plan to further develop Parallelware tools to support C++and Fortran, as well as other task-based parallel versions tuned for execution onGPUs and FPGAs. We also plan to extend the number of software issues relatedto concurrency and parallelism detected by the Parallelware tools and run themon a wider set of scientific and engineering software.
Acknowledgements
This work has been partly funded from the Spanish Ministry of Science andTechnology (TIN2015-65316-P), the Departament d’Innovaci´o, Universitats iEmpresa de la Generalitat de Catalunya (MPEXPAR: Models de Programaci´o iEntorns d’Execuci´o Parallels, 2014-SGR-1051), and the European Union’s Hori-zon 2020 research and innovation program throughgrant agreements MAESTRO(801101) and EPEEC (801051). The authors gratefully acknowledge the accessto the Juron system at J¨ulich Supercomputing Centre.
References
1. J. Andi´on, M. Arenaz, G. Rodr´ıguez, and J. Touri˜no. A Novel Compiler Supportfor Automatic Parallelization on Multicore Systems.
Parallel Computing , 39(9):442–460, 2013.2. Appentra. Defects and Recommendations for Concurrency and Parallelism. , 2019.3. Appentra. Parallelware tools. , 2019.4. M. Arenaz, J. Touri˜no, and R. Doallo. XARK: An Extensible Framework for Auto-matic Recognition of Computational Kernels.
ACM Transactions on ProgrammingLanguages and Systems (TOPLAS) , 30(6):32:1–32:56, 2008.5. D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi,P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, andS. Weeratunga. The NAS Parallel Benchmarks - Summary and Preliminary Results.In
Proceedings of the 1991 ACM/IEEE Conference on Supercomputing , Supercom-puting ’91, pages 158–165. ACM, 1991.6. Center for Manycore Programming, Seoul National University (SNU). SNU NPBSuite. http://aces.snu.ac.kr/software/snu-npb/ , 2013.7. OpenACC Architecture Review Board. The OpenACC Application ProgrammingInterface, Version 2.5. , Oct. 2015.8. OpenMP Architecture Review Board. OpenMP Application Program Interface,Version 4.5. , Nov. 2015.enchmark Files SLOC Time Software issues Opportunities(ms) Global Scope Pure Scoping Default Multi SIMDNPB3.3-SER-C/BT 17 2608 557.97 13 143 0 0 0 24 44NPB3.3-SER-C/CG 3 521 143.69 3 20 1 0 0 13 10NPB3.3-SER-C/DC 11 2725 430.67 10 0 3 0 0 13 0NPB3.3-SER-C/EP 2 175 88.61 1 0 0 0 0 2 0NPB3.3-SER-C/FT 7 625 238.77 4 0 1 0 0 0 0NPB3.3-SER-C/IS 2 463 69.3 4 0 0 0 0 4 0NPB3.3-SER-C/LU 19 2389 739.86 15 298 0 0 0 29 59NPB3.3-SER-C/MG 3 873 648.68 11 2 0 0 0 4 2NPB3.3-SER-C/SP 19 2056 683.6 19 381 0 0 0 28 91NPB3.3-SER-C/UA 13 5576 2181.73 53 163 0 0 0 77 69NPB3.3-SER-C/common 0 296 174.19 0 0 0 0 0 0 0NPB3.3-SER-C/config 0 0 28.71 0 0 0 0 0 0 0NPB3.3-SER-C/sys 1 759 182.71 2 0 0 0 0 0 0NPB3.3-SER-C 97 19066 6168.48 135 1007 5 0 0 194 275NPB3.3-OMP-C/BT 17 2693 568.03 13 144 0 41 9 8 4NPB3.3-OMP-C/CG 3 627 171.77 5 20 1 16 5 9 3NPB3.3-OMP-C/DC 11 2754 425.05 10 0 3 0 0 13 0NPB3.3-OMP-C/EP 2 198 92.4 1 0 0 4 3 0 0NPB3.3-OMP-C/FT 3 649 163.39 12 0 0 10 8 0 0NPB3.3-OMP-C/IS 2 634 88.15 6 4 0 7 4 4 0NPB3.3-OMP-C/LU 20 2542 778.31 17 295 0 55 9 5 0NPB3.3-OMP-C/MG 3 923 662.07 11 2 0 19 10 4 2NPB3.3-OMP-C/SP 19 2147 693.33 19 381 0 45 13 8 4NPB3.3-OMP-C/UA 14 6549 2749.03 65 229 0 132 56 67 24NPB3.3-OMP-C/bin 0 0 29.25 0 0 0 0 0 0 0NPB3.3-OMP-C/common 0 349 178.34 0 0 0 0 0 0 0NPB3.3-OMP-C/config 0 0 28.47 0 0 0 0 0 0 0NPB3.3-OMP-C/sys 1 759 179.29 2 0 0 0 0 0 0NPB3.3-OMP 95 20824 6806.88 161 1075 4 329 117 118 37Totals 192 39890 12975.36 296 2082 9 329 117 312 312
Table 1.
Parallelware Analyzer report.uron Appentra’s serverVersion No.Threads Time (secs) Speedup Time (secs) SpeedupSequential 4 89.50 1 178.92 1Multithreading 4 32.85 2.72 37.94 4.72Taskwait 4 23.30 3.84 24.38 7.34Taskloop 4 133.31 0.67 37.99 4.71Sequential 8 89.52 1 178.92 1Multithreading 8 31.77 2.82 38.96 4.59Taskwait 8 17.91 4.99 21.57 8.29Taskloop 8 143.22 0.63 38.82 4.61Sequential 16 89.51 1 178.93 1Multithreading 16 14.42 6.21 20.96 8.54Taskwait 16 7.67 11.67 12.02 14.89Taskloop 16 86.44 1.04 20.99 8.53Sequential 32 89.51 1 178.93 1Multithreading 32 8.57 10.45 10.93 16.37Taskwait 32 4.97 19.01 6.31 28.36Taskloop 32 99.93 0.89 11.05 16.19Sequential 64 89.52 1 178.92 1Multithreading 64 4.24 21.11 7.80 22.94Taskwait 64 2.60 34.43 6.33 28.27Taskloop 64 86.45 1.04 7.70 23.24Sequential 80 89.53 1Multithreading 80 3.50 25.58Taskwait 80 2.34 38.26Taskloop 80 86.45 1.04Sequential 128 89.51 1Multithreading 128 2.59 34.56Taskwait 128 1.64 54.58Taskloop 128 86.46 1.04Sequential 160 89.53 1Multithreading 160 2.53 35.39Taskwait 160 1.60 55.96Taskloop 160 86.42 1.04
Table 2.
Execution times (in seconds) and speedups of Mandelbrot in Juron (2x 10-core SMT8 POWER8 processors) and in Appentra’s POWER server (8-core SMT4POWER9) for problem size of 20000. isting 1.1.
OpenMP-enabled parallel version of Mandelbrot using multi-threading paradigm. Parallel code automatically generated by ParallelwareTrainer. i n t m a n d e l b r o t ( i n t m a x i t e r , i n t h e i g h t , i n t w i d t h ,double ∗∗ o u t p u t , double r e a l m i n , double r e a l m a x ,double i m a g m i n , double im ag m ax ) { double s c a l e r e a l = ( r e a l m a x − r e a l m i n ) / w i d t h ; double s c a l e i m a g = ( im ag m ax − i m a g m i n ) / h e i g h t ; { f o r ( i n t row = 0 ; row < h e i g h t ; row ++) { f o r ( i n t c o l = 0 ; c o l < w i d t h ; c o l ++) { double x0 = r e a l m i n + c o l ∗ s c a l e r e a l ; double y0 = i m a g m i n + row ∗ s c a l e i m a g ; double y = 0 , x = 0 ; i n t i t e r = 0 ; w h i l e ( x ∗ x + y ∗ y < i t e r < m a x i t e r ) { double xtemp = x ∗ x − y ∗ y + x0 ; y = 2 ∗ x ∗ y + y0 ; x = xtemp ; i t e r ++; } o u t p u t [ row ] [ c o l ] = i t e r ; } } } // end p a r a l l e l r e t u r n 0 ; } isting 1.2. OpenMP-enabled parallel version of Mandelbrot using taskingparadigm of OpenMP version 3.0 (task/taskwait). Parallel code automaticallygenerated by Parallelware Trainer. i n t m a n d e l b r o t ( i n t m a x i t e r , i n t h e i g h t , i n t w i d t h ,double ∗∗ o u t p u t , double r e a l m i n , double r e a l m a x ,double i m a g m i n , double im ag m ax ) { double s c a l e r e a l = ( r e a l m a x − r e a l m i n ) / w i d t h ; double s c a l e i m a g = ( im ag m ax − i m a g m i n ) / h e i g h t ; { f o r ( i n t row = 0 ; row < h e i g h t ; row ++) { { f o r ( i n t c o l = 0 ; c o l < w i d t h ; c o l ++) { . . . o u t p u t [ row ] [ c o l ] = i t e r ; } } // end t a s k } } // end p a r a l l e l r e t u r n 0 ; } Listing 1.3.
OpenMP-enabled parallel version of Mandelbrot using taskingparadigm of OpenMP version 4.5 (taskloop). Parallel code automatically gen-erated by Parallelware Trainer. i n t m a n d e l b r o t ( i n t m a x i t e r , i n t h e i g h t , i n t w i d t h ,double ∗∗ o u t p u t , double r e a l m i n , double r e a l m a x ,double i m a g m i n , double im ag m ax ) { double s c a l e r e a l = ( r e a l m a x − r e a l m i n ) / w i d t h ; double s c a l e i m a g = ( im ag m ax − i m a g m i n ) / h e i g h t ; { f o r ( i n t row = 0 ; row < h e i g h t ; row ++) { f o r ( i n t c o l = 0 ; c o l < w i d t h ; c o l ++) { . . . o u t p u t [ row ] [ c o l ] = i t e r ; } } } // end p a r a l l e l r e t u r n 0 ;18